Logo

dev-resources.site

for different kinds of informations.

Production and CI/CD in dbt

Published at
3/13/2024
Categories
dbt
dataengineering
github
Author
cmcrawford2
Categories
3 categories in total
dbt
open
dataengineering
open
github
open
Author
11 person written this
cmcrawford2
open
Production and CI/CD in dbt

This is what I've learned about taking a dbt project into production and running Continuous Integration. It's from data talks club's data engineering zoomcamp, week 4. You can watch this video for the original instruction.

First log in to dbt. Go to cloud.getdbt.com and log in if you aren't logged in already. Select "develop" and choose "Cloud IDE". You'll see the familiar environment from the last dbt lesson.

Deployment consists of creating a pull request and merging the code into the main branch. This will affect our production environment. For instance, in development, we limited the data. In production, we want all of the data. Also, not everyone will be able to access the data in the same way. And we could put the data in a different place.

The main branch and the developer branches don't know about each other. But when we're done developing, we open a pull request. Ideally, a maintainer will check our code and merge it. We will also see how this process can be automated.

Then we can run the models in production. And we can schedule the models to run, daily, hourly, etc. to keep our model up to date.

dbt cloud includes a scheduler to create jobs to run in production. These jobs can be triggered manually or on schedule. A job can also generate documentation that can be viewed under the run information. And we can check for source freshness.

I created a new environment called "Production". I selected deploy, chose "environments" from the drop-down menu, and created a new environment. I called the dataset "prod". Then I created a deploy job called "nightly". It came with "dbt build" by default as a command, but we could change that. We could run multiple commands in the same job. The job will generate metadata that we can use afterwards. I left the command as it was.

I also checked the boxes for "Generate docs on run" and "Run source freshness". Then I scheduled it for hours of the day, "12", and unchecked Saturday and Sunday.

So I've scheduled a run now, and I can also run the job manually. There's third way to run, which is to trigger a run with an api. We could have a pipeline where we load the data, transform it, and save it, then trigger a run once we're ready to run it.

When the run is done, we can look at it. Select the run. There we can find documentation. On the lower right, we can select "lineage". Here a picture of my run:

lineage graph of my dbt run

Next I created a continuous integration job. Continuous Integration (CI) is the practice of regularly merging development branches into a central repository, after which automated builds and tests are run. dbt allows us to enable CI on pull requests. When a PR is ready to be merged, a webhooks is received in dbt cloud that will start a new run of the specified job. It will run against a temporary schema, and will not be merged until the run is completed successfully.

The way this works is we make a change in a file, then submit it and create a pull request using the large green button on the left. This will trigger the CI run. Once the run succeeds, someone will manually merge the PR on GitHub.

It's also important to know how to stop these runs. That wasn't covered in the video. If you don't stop them, they will run forever or until, like me, you take the class again and discover you have a dbt job that's still conducting nightly runs from two years ago. Go to the job you want to delete. Select "settings" in the upper right. Choose "edit" and scroll down to the bottom. You should see a delete button. Select this and tell it you really want to delete, and you're done!

dbt Article's
30 articles in total
Favicon
Parte 1: Introdução ao dbt
Favicon
Explorer l'API de 360Learning : de l'agilité de Power Query à la robustesse de la Modern Data Stack
Favicon
Cross-Project Dependencies Handling with DBT in AWS MWAA
Favicon
Building a User-Friendly, Budget-Friendly Alternative to dbt Cloud
Favicon
Working with Gigantic Google BigQuery Partitioned Tables in DBT
Favicon
dbt (Data Build Tool). Data Engineering Student's point of view.
Favicon
An End-to-End Guide to dbt (Data Build Tool) with a Use Case Example
Favicon
Avoid These Top 10 Mistakes When Using Apache Spark
Favicon
DBT and Software Engineering
Favicon
Analyzing Svenskalag Data using DBT and DuckDB
Favicon
Becoming an Analytics Engineer I
Favicon
Final project part 5
Favicon
Visualization in dbt
Favicon
Building a project in DBT
Favicon
DBT (Data Build Tool)
Favicon
Production and CI/CD in dbt
Favicon
Comparing Snowflake Dynamic Tables with dbt
Favicon
A 2024 Explainer dbt Core vs dbt Cloud (Enterprise)
Favicon
Testing and documenting DBT models
Favicon
Introduction to dbt
Favicon
Creating a Self-Service Data Model
Favicon
Simplifying Data Transformation in Redshift: An Approach with DBT and Airflow
Favicon
Avoiding the DBT Monolith
Favicon
How Starburst’s data engineering team builds resilient telemetry data pipelines
Favicon
Building ETL/ELT Pipelines For Data Engineers.
Favicon
Running Transformations on BigQuery using dbt Cloud: step by step
Favicon
Managing UDFs in dbt
Favicon
Building a Modern Data Pipeline: A Deep Dive into Terraform, AWS Lambda and S3, Snowflake, DBT, Mage AI, and Dash
Favicon
End-to-End Data Ingestion, Transformation and Orchestration with Airbyte, dbt and Kestra
Favicon
Build an Open Source LakeHouse with minimun code effort (Spark + Hudi + DBT+ Hivemetastore + Trino)

Featured ones: