Logo

dev-resources.site

for different kinds of informations.

Testing and documenting DBT models

Published at
2/24/2024
Categories
dbt
testing
dataengineering
Author
cmcrawford2
Categories
3 categories in total
dbt
open
testing
open
dataengineering
open
Author
11 person written this
cmcrawford2
open
Testing and documenting DBT models

In this video we learned how to test and document dbt models. We also learned about the codegen library. This is part of Week 4 of the data engineering zoomcamp by DataTalksClub.

We want to identify errors quickly, and we don't want to build on top of data that is full of errors. Tests are assumptions we make about the data. These assumptions are compiled to SQL, which returns the number and type of failing records. They are defined on table columns in the .yml file.

DBT provides tools to check if column values are unique, not null, accepted values, and existing foreign keys in another table. We can also create custom tests as queries.

In order to do this quickly, I installed a library called "codegen". This library generates code that I could copy-paste into files. I used the code for creating columns in a yaml file. First I installed codegen by typing into packages.yml:

  - package: dbt-labs/codegen
    version: 0.12.1
Enter fullscreen mode Exit fullscreen mode

Then I ran dbt deps in the command line area. Next, I opened a scratch file (+ on the upper right of the tabs), and pasted the following:

{% set models_to_generate = codegen.get_models(directory='staging', prefix='stg') %}
{{ codegen.generate_model_yaml(
    model_names = models_to_generate
) }}
Enter fullscreen mode Exit fullscreen mode

Then I hit "compile selection". This generated the column names for all the models that started with stg, along with their types and a blank "description" field. I copied all of this into schema.yml under the name "models". Here's a partial view:

version: 2

sources:
  - name: staging
    database:
      data-engineering-2024-411821
      # For postgres:
      #database: production
    schema:
      trips_data_all

      # loaded_at_field: record_loaded_at
    tables:
      - name: green_tripdata
      - name: yellow_tripdata
      - name: fhv_tripdata

models:
  - name: stg_green_tripdata
    description: ""
    columns:
      - name: tripid
        data_type: string
        description: ""
        tests:
          - unique:
              severity: warn
          - not_null:
              severity: warn

      - name: vendorid
        data_type: int64
        description: ""

      - name: ratecodeid
        data_type: int64
        description: ""
Enter fullscreen mode Exit fullscreen mode

You can see where I added a test for the tripid. I wanted tripid to be unique and not null. These are two of the tests provided by dbt. The severity is warn because for now, I don't actually want to stop the build.

Another test is relationships. For this, I needed to identify the field and table that's expected to match the values in the column.

      - name: pickup_locationid
        data_type: int64
        description: ""
        tests:
          - relationships:
              field: locationid
              to: ref('taxi_zone_lookup')
              severity: warn

      - name: dropoff_locationid
        data_type: int64
        description: ""
        tests:
          - relationships:
              field: locationid
              to: ref('taxi_zone_lookup')
              severity: warn

Enter fullscreen mode Exit fullscreen mode

finally, I added a test for accepted values to the payment type. I know there are six payment types, so I defined the variable "payment_type_values" from before (in dbt_project.yml):

vars:
  payment_type_values: [1, 2, 3, 4, 5]
Enter fullscreen mode Exit fullscreen mode

I know that there are six values, and I wanted to test my tests, so I left out "6". Back in schema.yml, I added the test accepted_values to payment_type using the "var" macro to insert it. quote:false is apparently something we need for BigQuery, but that wasn't explained.

      - name: payment_type
        data_type: int64
        description: ""
        tests:
          - accepted_values:
              values: "{{ var('payment_type_values') }}"
              severity: warn
              quote: false

Enter fullscreen mode Exit fullscreen mode

I ran the tests, and I saw that I got a warning that there are some payment_type values that aren't expected. This is actually very good. We can test our data, which consists of millions of rows, to make sure there's nothing wrong with it.

dbt Article's
30 articles in total
Favicon
Parte 1: Introdução ao dbt
Favicon
Explorer l'API de 360Learning : de l'agilité de Power Query à la robustesse de la Modern Data Stack
Favicon
Cross-Project Dependencies Handling with DBT in AWS MWAA
Favicon
Building a User-Friendly, Budget-Friendly Alternative to dbt Cloud
Favicon
Working with Gigantic Google BigQuery Partitioned Tables in DBT
Favicon
dbt (Data Build Tool). Data Engineering Student's point of view.
Favicon
An End-to-End Guide to dbt (Data Build Tool) with a Use Case Example
Favicon
Avoid These Top 10 Mistakes When Using Apache Spark
Favicon
DBT and Software Engineering
Favicon
Analyzing Svenskalag Data using DBT and DuckDB
Favicon
Becoming an Analytics Engineer I
Favicon
Final project part 5
Favicon
Visualization in dbt
Favicon
Building a project in DBT
Favicon
DBT (Data Build Tool)
Favicon
Production and CI/CD in dbt
Favicon
Comparing Snowflake Dynamic Tables with dbt
Favicon
A 2024 Explainer dbt Core vs dbt Cloud (Enterprise)
Favicon
Testing and documenting DBT models
Favicon
Introduction to dbt
Favicon
Creating a Self-Service Data Model
Favicon
Simplifying Data Transformation in Redshift: An Approach with DBT and Airflow
Favicon
Avoiding the DBT Monolith
Favicon
How Starburst’s data engineering team builds resilient telemetry data pipelines
Favicon
Building ETL/ELT Pipelines For Data Engineers.
Favicon
Running Transformations on BigQuery using dbt Cloud: step by step
Favicon
Managing UDFs in dbt
Favicon
Building a Modern Data Pipeline: A Deep Dive into Terraform, AWS Lambda and S3, Snowflake, DBT, Mage AI, and Dash
Favicon
End-to-End Data Ingestion, Transformation and Orchestration with Airbyte, dbt and Kestra
Favicon
Build an Open Source LakeHouse with minimun code effort (Spark + Hudi + DBT+ Hivemetastore + Trino)

Featured ones: