Logo

dev-resources.site

for different kinds of informations.

I built a data pipeline tool in Go

Published at
12/23/2024
Categories
data
datascience
dataengineering
golang
Author
burakkarakan
Author
12 person written this
burakkarakan
open
I built a data pipeline tool in Go

Over the past few years, the data world has convinced itself that it needs many different tools to extract insights:

  • one tool to ingest data
  • another one to transform it
  • another one to check the quality
  • another one to orchestrate them all
  • another one for cataloging
  • another one for governance

The result? A very fragile, expensive, and rigid infrastructure with a terrible experience. Teams have to build a lot of glue between these systems, and try to get different parts of these systems to talk to each other while trying to onboard analytical teams on them.

Does it work? No.

Are we ready to have that conversation? I hope so.

Obsessing over impact

The engineering work behind building a ticking machine is very satisfying: small pieces that each do their part, and they work like a clock. It feels like an engineering marvel:

It is simple: you push code to the main branch, the backend automatically pulls the branch, pulls the DAGs and uploads them to S3. The sync sidecars on Airflow containers automatically pull from S3, which will then update the DAGs. When the DAG runs, for data ingestion jobs it will connect to our Airbyte deployment and trigger the ingestion from Airflow, then we create a sensor that waits until the ingestion is done. Then we connect to dbt Cloud to initiate some parts of the transformation jobs from the analytical team, if anything fails Airflow connects to our notification system to find the right team if they are defined on the catalog, if not we check our AD users and try to find a matching org to send a notification. Once the transformation is done then we execute our custom Python operators that do X and Y, then we provision a pod in our Kubernetes cluster to run quality checks. Our Kafka sinks are ingesting CDC data from the internal Postgres with Debezium in the meantime, then load them to the data lake in Parquet format, then we register them as Glue tables so that they can be queried, then the sensors in our Airflow clusters keep track of these states to run SQL transformations with the internal framework, andโ€ฆ

Sounds ridiculous, doesnโ€™t it? It certainly does to me, whereas this is a very common response when we ask engineering teams what their data infrastructure looks like. The joy they get out of building a house of cards is far more important than the business impact that is being delivered. In the meantime, the analytics teams, data analysts, data scientists, and the business teams are waiting for their questions to be answered, trying to understand why it takes 6 weeks to get a new chart on their sales dashboard.

I am not sure if this is due to ZIRP or not, but it is pretty easy to spot organizations where highly inefficient engineering teams, coupled with engineering leaders who donโ€™t know what their teams are doing, are ruling the game, where the people that create real value with data is left alone. They have to jump through a billion different tools, trying to figure out why their dashboard didnโ€™t update, and waiting for a response to their ticket from the central data team.

These are data analysts in business teams, a growth hacker running marketing campaigns over 5 different platforms, or an all-rounder data scientist who is trying to predict LTV. They are trying to create a real impact, but their progress is being heavily hindered by the internal toys.

We are building Bruin for these people: simpler data tooling for impact-obsessed teams.

Bruin CLI & VS Code Extension

Bruin CLI is an end-to-end data pipeline tool that brings together data ingestion, data transformation with SQL and Python, and data quality in a single framework.

Bruin is batteries-included:

  • ๐Ÿ“ฅ ingest data with ingestr / Python
  • โœจ run SQL & Python transformations on many platforms
  • ๐Ÿ“ table/view materializations, incremental tables
  • ๐Ÿ run Python in isolated environments using uv
  • ๐Ÿ’… built-in data quality checks
  • ๐Ÿš€ Jinja templating to avoid repetition
  • โœ… validate pipelines end-to-end via dry-run
  • ๐Ÿ‘ท run on your local machine, an EC2 instance, or GitHub Actions
  • ๐Ÿ”’ secrets injection via environment variables
  • VS Code extension for a better developer experience
  • โšก written in Golang
  • ๐Ÿ“ฆ easy to install and use

This means using Bruin, teams can build end-to-end workflows without having to resort to a bunch of different tools. It is extensible enough with the usage of SQL and Python, while also guiding the users through its opinionated approach to build maintainable data pipelines.

One of the things that accompany Bruin CLI is our open-source Visual Studio Code extension:

Bruin VS Code Demo

The extension does a few stuff that makes it pretty unique:

  • While everything in Bruin is driven with code, the extension adds a UI layer on top of it, which means you get:
    • visual documentation
    • rendered queries
    • column & quality checks
    • lineage
    • the ability to validate code & run backfills
    • syntax highlighting
  • everything happens locally, which means there are no external servers or systems that can access any of your data
  • extension visualizes a lot of the configuration options, which makes it trivial to run backfills, validations, and more.

This is a good example of our design principles: everything is version-controlled, while also giving a better experience through a thoughtful UI.

The extension is a first-class citizen of the Bruin ecosystem, and we intend to expand its functionality further to make it the easiest platform to build data workloads out there.

Supported Platforms

Bruin supports many of the cloud data platforms out of the box at the launch:

  • AWS Athena
  • Databricks
  • DuckDB
  • Google BigQuery
  • Microsoft SQL Server
  • Postgres
  • Redshift
  • Snowflake
  • Synapse

The list of platforms we support will grow more and more over time. We are always looking forward to hearing community feedback on these, so feel free to share your thoughts with us in our Slack community.

Bruin Cloud

We are building Bruin for those obsessed with impact. You can go from zero to full data pipelines in minutes, and we are dedicated to making this experience even bigger. Using all of our open-source tooling you can build and run all of your data workloads locally, on GitHub Actions, in Airflow, or anywhere else.

While we do believe there are many useful deployment options of Bruin CLI across different infrastructures, we are also obsessed with building the best managed-experience for building and running Bruin workloads on production. Thatโ€™s why we are building Bruin Cloud:
Lineage view on Bruin Cloud

It has quite a few niceties:

  • managed environment for ingestion, transformation, and ML workloads
  • column-level lineage
  • governance & cost reporting
  • team management
  • cross-pipeline dependencies
  • multi-repo โ€œmeshโ€

and quite a few more. Feel free to drop your email to get a demo.

Share your thoughts

We are very excited to share Bruin CLI & VSCode Extension with the world, and we would love to hear from the community. Weโ€™d appreciate if you shared your thoughts on what would make Bruin more useful for your needs.

https://github.com/bruin-data/bruin

dataengineering Article's
30 articles in total
Favicon
Handling Dates in Argo Workflows
Favicon
Massively Scalable Processing & Massively Parallel Processing
Favicon
Pandas + NBB data ๐Ÿผ๐Ÿ€
Favicon
Data Engineering Foundations: A Hands-On Guide
Favicon
When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability
Favicon
Using Apache Parquet to Optimize Data Handling in a Real-Time Ad Exchange Platform
Favicon
The Columnar Approach: A Deep Dive into Efficient Data Storage for Analytics ๐Ÿš€
Favicon
Optimizing Data Pipelines for Fiix Dating App
Favicon
What kind of Data Team should I join?
Favicon
Tech Interviews: The Hustle Behind Tech Interview Prep
Favicon
New article alert! Data Engineering with Scala: mastering data processing with Apache Flink and Pub/Sub โค๏ธโ€๐Ÿ”ฅ
Favicon
Hire Big Data Developers for Scalable Solutions
Favicon
Why Feature Scaling Should Be Done After Splitting Your Dataset into Training and Test Sets
Favicon
How Data Analytics in the Cloud Can Level Up Your App
Favicon
Exploring OSM changesets via DuckDB
Favicon
Unlocking the Potential of the JOI Database
Favicon
I built a data pipeline tool in Go
Favicon
Data engineer, plsql
Favicon
Data Warehousing Architectures
Favicon
Cultivating a Data-Centric Culture at Work
Favicon
How Genius Sports slashed costs and lowered latencies for last-mile data delivery
Favicon
Read, Like & Share
Favicon
Surge Datalab Private Limited
Favicon
๐Ÿคฏ #NODES24: a practical path to Cloud-Native Knowledge Graph Automation & AI Agents
Favicon
Can AI finally generate best practice code? I think so.
Favicon
How to Prevent Duplication in Data Aggregation with BladePipe
Favicon
How to Migrate Massive Data in Record Timeโ€”Without a Single Minute of Downtime ๐Ÿ•‘
Favicon
aMarketForce: Premier Contact List Development & Data Solutions
Favicon
Image processing in JAVA
Favicon
Data Engineering Essentials for E-commerce from ETL to Real-Time Analytics

Featured ones: