dev-resources.site
for different kinds of informations.
I built a data pipeline tool in Go
Over the past few years, the data world has convinced itself that it needs many different tools to extract insights:
- one tool to ingest data
- another one to transform it
- another one to check the quality
- another one to orchestrate them all
- another one for cataloging
- another one for governance
The result? A very fragile, expensive, and rigid infrastructure with a terrible experience. Teams have to build a lot of glue between these systems, and try to get different parts of these systems to talk to each other while trying to onboard analytical teams on them.
Does it work? No.
Are we ready to have that conversation? I hope so.
Obsessing over impact
The engineering work behind building a ticking machine is very satisfying: small pieces that each do their part, and they work like a clock. It feels like an engineering marvel:
It is simple: you push code to the main branch, the backend automatically pulls the branch, pulls the DAGs and uploads them to S3. The sync sidecars on Airflow containers automatically pull from S3, which will then update the DAGs. When the DAG runs, for data ingestion jobs it will connect to our Airbyte deployment and trigger the ingestion from Airflow, then we create a sensor that waits until the ingestion is done. Then we connect to dbt Cloud to initiate some parts of the transformation jobs from the analytical team, if anything fails Airflow connects to our notification system to find the right team if they are defined on the catalog, if not we check our AD users and try to find a matching org to send a notification. Once the transformation is done then we execute our custom Python operators that do X and Y, then we provision a pod in our Kubernetes cluster to run quality checks. Our Kafka sinks are ingesting CDC data from the internal Postgres with Debezium in the meantime, then load them to the data lake in Parquet format, then we register them as Glue tables so that they can be queried, then the sensors in our Airflow clusters keep track of these states to run SQL transformations with the internal framework, andโฆ
Sounds ridiculous, doesnโt it? It certainly does to me, whereas this is a very common response when we ask engineering teams what their data infrastructure looks like. The joy they get out of building a house of cards is far more important than the business impact that is being delivered. In the meantime, the analytics teams, data analysts, data scientists, and the business teams are waiting for their questions to be answered, trying to understand why it takes 6 weeks to get a new chart on their sales dashboard.
I am not sure if this is due to ZIRP or not, but it is pretty easy to spot organizations where highly inefficient engineering teams, coupled with engineering leaders who donโt know what their teams are doing, are ruling the game, where the people that create real value with data is left alone. They have to jump through a billion different tools, trying to figure out why their dashboard didnโt update, and waiting for a response to their ticket from the central data team.
These are data analysts in business teams, a growth hacker running marketing campaigns over 5 different platforms, or an all-rounder data scientist who is trying to predict LTV. They are trying to create a real impact, but their progress is being heavily hindered by the internal toys.
We are building Bruin for these people: simpler data tooling for impact-obsessed teams.
Bruin CLI & VS Code Extension
Bruin CLI is an end-to-end data pipeline tool that brings together data ingestion, data transformation with SQL and Python, and data quality in a single framework.
Bruin is batteries-included:
- ๐ฅ ingest data with ingestr / Python
- โจ run SQL & Python transformations on many platforms
- ๐ table/view materializations, incremental tables
- ๐ run Python in isolated environments using uv
- ๐ built-in data quality checks
- ๐ Jinja templating to avoid repetition
- โ validate pipelines end-to-end via dry-run
- ๐ท run on your local machine, an EC2 instance, or GitHub Actions
- ๐ secrets injection via environment variables
- VS Code extension for a better developer experience
- โก written in Golang
- ๐ฆ easy to install and use
This means using Bruin, teams can build end-to-end workflows without having to resort to a bunch of different tools. It is extensible enough with the usage of SQL and Python, while also guiding the users through its opinionated approach to build maintainable data pipelines.
One of the things that accompany Bruin CLI is our open-source Visual Studio Code extension:
The extension does a few stuff that makes it pretty unique:
- While everything in Bruin is driven with code, the extension adds a UI layer on top of it, which means you get:
- visual documentation
- rendered queries
- column & quality checks
- lineage
- the ability to validate code & run backfills
- syntax highlighting
- everything happens locally, which means there are no external servers or systems that can access any of your data
- extension visualizes a lot of the configuration options, which makes it trivial to run backfills, validations, and more.
This is a good example of our design principles: everything is version-controlled, while also giving a better experience through a thoughtful UI.
The extension is a first-class citizen of the Bruin ecosystem, and we intend to expand its functionality further to make it the easiest platform to build data workloads out there.
Supported Platforms
Bruin supports many of the cloud data platforms out of the box at the launch:
- AWS Athena
- Databricks
- DuckDB
- Google BigQuery
- Microsoft SQL Server
- Postgres
- Redshift
- Snowflake
- Synapse
The list of platforms we support will grow more and more over time. We are always looking forward to hearing community feedback on these, so feel free to share your thoughts with us in our Slack community.
Bruin Cloud
We are building Bruin for those obsessed with impact. You can go from zero to full data pipelines in minutes, and we are dedicated to making this experience even bigger. Using all of our open-source tooling you can build and run all of your data workloads locally, on GitHub Actions, in Airflow, or anywhere else.
While we do believe there are many useful deployment options of Bruin CLI across different infrastructures, we are also obsessed with building the best managed-experience for building and running Bruin workloads on production. Thatโs why we are building Bruin Cloud:
Lineage view on Bruin Cloud
It has quite a few niceties:
- managed environment for ingestion, transformation, and ML workloads
- column-level lineage
- governance & cost reporting
- team management
- cross-pipeline dependencies
- multi-repo โmeshโ
and quite a few more. Feel free to drop your email to get a demo.
Share your thoughts
We are very excited to share Bruin CLI & VSCode Extension with the world, and we would love to hear from the community. Weโd appreciate if you shared your thoughts on what would make Bruin more useful for your needs.
Featured ones: