Logo

dev-resources.site

for different kinds of informations.

Final project part 5

Published at
4/3/2024
Categories
dataengineering
dbt
lookerstudio
Author
cmcrawford2
Categories
3 categories in total
dataengineering
open
dbt
open
lookerstudio
open
Author
11 person written this
cmcrawford2
open
Final project part 5

dbt is the main part of my data engineering project for Data Talks Club's data engineering zoomcamp. After a few frustrating errors on my part, I finally figured out how to make models, where to put the staging models and where to put the core models, how to compile a seed file, and how to join it to the main file in order to produce data for visualization. I also used the git interface to continually upgrade my repository. This was extremely convenient and helpful.

First, I needed to join the voter activity file with the names of the parties that the voters were enrolled in. I had a .csv file provided by the Secretary of State's office that had a list of all the codes for the party designations together with the actual names of the parties. In the .csv file, these codes were three characters, but in my large table, I had converted them to strings of some standard size. So I had to trim them before joining in the first iteration of the voter activity model.

Here's the sql code that does this step:

    SELECT se.*, vd.party_name
    FROM state_elections se
    JOIN voter_designations vd ON trim(se.party_affiliation) = trim(vd.code)
Enter fullscreen mode Exit fullscreen mode

I also wanted to just look at the major elections - state elections in early November. In Massachusetts, we have presidential elections at the same time as the rest of the country, in years divisible by 4. We have elections for all statewide candidates in the even-numbered years in between. So I created a filter:

    SELECT *
    FROM {{ ref('staging_voter_data') }}
    WHERE election_type = 'STATE ELECTION'
      AND MOD(EXTRACT(YEAR FROM formatted_datetime), 2) = 0
      AND EXTRACT(MONTH FROM formatted_datetime) = 11
      AND EXTRACT(DAY FROM formatted_datetime) < 10
Enter fullscreen mode Exit fullscreen mode

It seems now like most of my errors were formatting errors in the sql commands, for example, putting a semicolon at the end of the whole command, when I had an option to limit the table to 100 records. These errors were frustrating, but I managed to solve them all.

I created two more models from the voter activity table. The first had an additional field of third party voters, and the second used this to create a percentage of third party voters in each state election. This last file was very small. I used it to create a chart in Google looker studio that listed the percentage of third party voters in each election. Google looker studio has a difficult user interface, so this was the only way I could figure out how to present this data.

Also, because I was still learning how to use dbt, I probably could have made these last two files into one file. It was mostly my struggles with Google looker studio that motivated the way I made these tables.

Finally, I made a schema file for reading the raw data into the staging files, which checked for null fields. The voter registration table did not have any null fields, which is a testimony to my ability to transform the file from a CD into a table. The first file had some null records. I managed to catch some of them early on and delete the voter and election, which didn't really matter since there were so many records (over 80,000,000).

dbt Article's
30 articles in total
Favicon
Parte 1: Introdução ao dbt
Favicon
Explorer l'API de 360Learning : de l'agilité de Power Query à la robustesse de la Modern Data Stack
Favicon
Cross-Project Dependencies Handling with DBT in AWS MWAA
Favicon
Building a User-Friendly, Budget-Friendly Alternative to dbt Cloud
Favicon
Working with Gigantic Google BigQuery Partitioned Tables in DBT
Favicon
dbt (Data Build Tool). Data Engineering Student's point of view.
Favicon
An End-to-End Guide to dbt (Data Build Tool) with a Use Case Example
Favicon
Avoid These Top 10 Mistakes When Using Apache Spark
Favicon
DBT and Software Engineering
Favicon
Analyzing Svenskalag Data using DBT and DuckDB
Favicon
Becoming an Analytics Engineer I
Favicon
Final project part 5
Favicon
Visualization in dbt
Favicon
Building a project in DBT
Favicon
DBT (Data Build Tool)
Favicon
Production and CI/CD in dbt
Favicon
Comparing Snowflake Dynamic Tables with dbt
Favicon
A 2024 Explainer dbt Core vs dbt Cloud (Enterprise)
Favicon
Testing and documenting DBT models
Favicon
Introduction to dbt
Favicon
Creating a Self-Service Data Model
Favicon
Simplifying Data Transformation in Redshift: An Approach with DBT and Airflow
Favicon
Avoiding the DBT Monolith
Favicon
How Starburst’s data engineering team builds resilient telemetry data pipelines
Favicon
Building ETL/ELT Pipelines For Data Engineers.
Favicon
Running Transformations on BigQuery using dbt Cloud: step by step
Favicon
Managing UDFs in dbt
Favicon
Building a Modern Data Pipeline: A Deep Dive into Terraform, AWS Lambda and S3, Snowflake, DBT, Mage AI, and Dash
Favicon
End-to-End Data Ingestion, Transformation and Orchestration with Airbyte, dbt and Kestra
Favicon
Build an Open Source LakeHouse with minimun code effort (Spark + Hudi + DBT+ Hivemetastore + Trino)

Featured ones: