dev-resources.site
for different kinds of informations.
Final project part 5
dbt is the main part of my data engineering project for Data Talks Club's data engineering zoomcamp. After a few frustrating errors on my part, I finally figured out how to make models, where to put the staging models and where to put the core models, how to compile a seed file, and how to join it to the main file in order to produce data for visualization. I also used the git interface to continually upgrade my repository. This was extremely convenient and helpful.
First, I needed to join the voter activity file with the names of the parties that the voters were enrolled in. I had a .csv file provided by the Secretary of State's office that had a list of all the codes for the party designations together with the actual names of the parties. In the .csv file, these codes were three characters, but in my large table, I had converted them to strings of some standard size. So I had to trim them before joining in the first iteration of the voter activity model.
Here's the sql code that does this step:
SELECT se.*, vd.party_name
FROM state_elections se
JOIN voter_designations vd ON trim(se.party_affiliation) = trim(vd.code)
I also wanted to just look at the major elections - state elections in early November. In Massachusetts, we have presidential elections at the same time as the rest of the country, in years divisible by 4. We have elections for all statewide candidates in the even-numbered years in between. So I created a filter:
SELECT *
FROM {{ ref('staging_voter_data') }}
WHERE election_type = 'STATE ELECTION'
AND MOD(EXTRACT(YEAR FROM formatted_datetime), 2) = 0
AND EXTRACT(MONTH FROM formatted_datetime) = 11
AND EXTRACT(DAY FROM formatted_datetime) < 10
It seems now like most of my errors were formatting errors in the sql commands, for example, putting a semicolon at the end of the whole command, when I had an option to limit the table to 100 records. These errors were frustrating, but I managed to solve them all.
I created two more models from the voter activity table. The first had an additional field of third party voters, and the second used this to create a percentage of third party voters in each state election. This last file was very small. I used it to create a chart in Google looker studio that listed the percentage of third party voters in each election. Google looker studio has a difficult user interface, so this was the only way I could figure out how to present this data.
Also, because I was still learning how to use dbt, I probably could have made these last two files into one file. It was mostly my struggles with Google looker studio that motivated the way I made these tables.
Finally, I made a schema file for reading the raw data into the staging files, which checked for null fields. The voter registration table did not have any null fields, which is a testimony to my ability to transform the file from a CD into a table. The first file had some null records. I managed to catch some of them early on and delete the voter and election, which didn't really matter since there were so many records (over 80,000,000).
Featured ones: