dev-resources.site

for different kinds of informations.

Testing and documenting DBT models

Published at

2/24/2024

Categories

3 categories in total

Author

11 person written this

cmcrawford2

open

In this video we learned how to test and document dbt models. We also learned about the codegen library. This is part of Week 4 of the data engineering zoomcamp by DataTalksClub.

We want to identify errors quickly, and we don't want to build on top of data that is full of errors. Tests are assumptions we make about the data. These assumptions are compiled to SQL, which returns the number and type of failing records. They are defined on table columns in the .yml file.

DBT provides tools to check if column values are unique, not null, accepted values, and existing foreign keys in another table. We can also create custom tests as queries.

In order to do this quickly, I installed a library called "codegen". This library generates code that I could copy-paste into files. I used the code for creating columns in a yaml file. First I installed codegen by typing into packages.yml:

  - package: dbt-labs/codegen
    version: 0.12.1

Then I ran dbt deps in the command line area. Next, I opened a scratch file (+ on the upper right of the tabs), and pasted the following:

{% set models_to_generate = codegen.get_models(directory='staging', prefix='stg') %}
{{ codegen.generate_model_yaml(
    model_names = models_to_generate
) }}

Then I hit "compile selection". This generated the column names for all the models that started with stg, along with their types and a blank "description" field. I copied all of this into schema.yml under the name "models". Here's a partial view:

version: 2

sources:
  - name: staging
    database:
      data-engineering-2024-411821
      # For postgres:
      #database: production
    schema:
      trips_data_all

      # loaded_at_field: record_loaded_at
    tables:
      - name: green_tripdata
      - name: yellow_tripdata
      - name: fhv_tripdata

models:
  - name: stg_green_tripdata
    description: ""
    columns:
      - name: tripid
        data_type: string
        description: ""
        tests:
          - unique:
              severity: warn
          - not_null:
              severity: warn

      - name: vendorid
        data_type: int64
        description: ""

      - name: ratecodeid
        data_type: int64
        description: ""

You can see where I added a test for the tripid. I wanted tripid to be unique and not null. These are two of the tests provided by dbt. The severity is warn because for now, I don't actually want to stop the build.

Another test is relationships. For this, I needed to identify the field and table that's expected to match the values in the column.

      - name: pickup_locationid
        data_type: int64
        description: ""
        tests:
          - relationships:
              field: locationid
              to: ref('taxi_zone_lookup')
              severity: warn

      - name: dropoff_locationid
        data_type: int64
        description: ""
        tests:
          - relationships:
              field: locationid
              to: ref('taxi_zone_lookup')
              severity: warn

finally, I added a test for accepted values to the payment type. I know there are six payment types, so I defined the variable "payment_type_values" from before (in dbt_project.yml):

vars:
  payment_type_values: [1, 2, 3, 4, 5]

I know that there are six values, and I wanted to test my tests, so I left out "6". Back in schema.yml, I added the test accepted_values to payment_type using the "var" macro to insert it. quote:false is apparently something we need for BigQuery, but that wasn't explained.

      - name: payment_type
        data_type: int64
        description: ""
        tests:
          - accepted_values:
              values: "{{ var('payment_type_values') }}"
              severity: warn
              quote: false

I ran the tests, and I saw that I got a warning that there are some payment_type values that aren't expected. This is actually very good. We can test our data, which consists of millions of rows, to make sure there's nothing wrong with it.

dbt Article's

30 articles in total