dev-resources.site
for different kinds of informations.
Production and CI/CD in dbt
This is what I've learned about taking a dbt project into production and running Continuous Integration. It's from data talks club's data engineering zoomcamp, week 4. You can watch this video for the original instruction.
First log in to dbt. Go to cloud.getdbt.com and log in if you aren't logged in already. Select "develop" and choose "Cloud IDE". You'll see the familiar environment from the last dbt lesson.
Deployment consists of creating a pull request and merging the code into the main branch. This will affect our production environment. For instance, in development, we limited the data. In production, we want all of the data. Also, not everyone will be able to access the data in the same way. And we could put the data in a different place.
The main branch and the developer branches don't know about each other. But when we're done developing, we open a pull request. Ideally, a maintainer will check our code and merge it. We will also see how this process can be automated.
Then we can run the models in production. And we can schedule the models to run, daily, hourly, etc. to keep our model up to date.
dbt cloud includes a scheduler to create jobs to run in production. These jobs can be triggered manually or on schedule. A job can also generate documentation that can be viewed under the run information. And we can check for source freshness.
I created a new environment called "Production". I selected deploy, chose "environments" from the drop-down menu, and created a new environment. I called the dataset "prod". Then I created a deploy job called "nightly". It came with "dbt build" by default as a command, but we could change that. We could run multiple commands in the same job. The job will generate metadata that we can use afterwards. I left the command as it was.
I also checked the boxes for "Generate docs on run" and "Run source freshness". Then I scheduled it for hours of the day, "12", and unchecked Saturday and Sunday.
So I've scheduled a run now, and I can also run the job manually. There's third way to run, which is to trigger a run with an api. We could have a pipeline where we load the data, transform it, and save it, then trigger a run once we're ready to run it.
When the run is done, we can look at it. Select the run. There we can find documentation. On the lower right, we can select "lineage". Here a picture of my run:
Next I created a continuous integration job. Continuous Integration (CI) is the practice of regularly merging development branches into a central repository, after which automated builds and tests are run. dbt allows us to enable CI on pull requests. When a PR is ready to be merged, a webhooks is received in dbt cloud that will start a new run of the specified job. It will run against a temporary schema, and will not be merged until the run is completed successfully.
The way this works is we make a change in a file, then submit it and create a pull request using the large green button on the left. This will trigger the CI run. Once the run succeeds, someone will manually merge the PR on GitHub.
It's also important to know how to stop these runs. That wasn't covered in the video. If you don't stop them, they will run forever or until, like me, you take the class again and discover you have a dbt job that's still conducting nightly runs from two years ago. Go to the job you want to delete. Select "settings" in the upper right. Choose "edit" and scroll down to the bottom. You should see a delete button. Select this and tell it you really want to delete, and you're done!
Featured ones: