Logo

dev-resources.site

for different kinds of informations.

Using Apache Parquet to Optimize Data Handling in a Real-Time Ad Exchange Platform

Published at
1/7/2025
Categories
bigdata
dataengineering
datascience
machinelearning
Author
mshidlov
Author
8 person written this
mshidlov
open
Using Apache Parquet to Optimize Data Handling in a Real-Time Ad Exchange Platform

A few years back, while working at cignal.io, I led the development of a real-time bidding platform for ad opportunities. This smart ad exchange managed a process called Real-Time Bidding (RTB). RTB is an automated system where advertisers bid in real time for the chance to display their ads to specific users visiting websites. When a partner sent an ad opportunity, our platform processed it through a series of real-time machine learning (ML) models to predict which advertising partner should receive the opportunity to bid. These models performed tasks like fraud detection, auction-winning prediction, matching advertising partners based on buying patterns, and identifying repeating opportunities. Ultimately, this system ensured that the highest bidder's ad was displayed, optimizing efficiency and relevance for advertisers and users alike.

The scale of the platform was staggering, handling 100,000 to 150,000 ad opportunities per second. Each opportunity was represented as a large JSON object of up to 2-3 KB in size. Not every opportunity received a bid; in fact, around 40-50% were filtered out by predictive models and never sent forward. For the remaining opportunities, if a bid was placed and won the auction, a notification was generated. This activity resulted in over 1 TB of data every hour. The sheer volume of data posed significant challenges for training ML models, especially when more than 90% of the data consisted of opportunities without bids.

Initial Steps to Manage Data Volume

To address the data explosion, we implemented a selective data writing approach. Only a small percentage of the ad opportunities were written to storage, focusing primarily on those that resulted in bids. For these, we added a flag to indicate whether the opportunity was part of the reduced write set. This allowed us to maintain balanced statistical information—for example, the number of ad opportunities originating from New York—while significantly reducing the volume of stored data.

This strategy improved the preprocessing workflow for Spark, which was used to join data fragments and prepare it for ML tasks. However, as the platform scaled, the demands on Spark clusters grew, increasing processing time. Delays in updating the models with new data affected the quality of real-time predictions, and the rising resource costs reduced the platform’s return on investment (ROI).

Transitioning to Apache Parquet

To solve these issues, we transitioned to storing all our data in Apache Parquet. Parquet is an open-source, columnar storage file format optimized for large-scale data processing and analytics. Developed collaboratively by Twitter and Cloudera and inspired by Google’s Dremel paper, Parquet became a top-level Apache project in 2015. Its columnar structure and support for efficient compression and encoding schemes made it an ideal choice for our use case.

We chose Snappy as the compression algorithm for Parquet, which balanced speed and efficiency. Parquet’s columnar format allowed us to store similar data types together, significantly improving compression ratios and reducing storage requirements. Additionally, Snappy compression enabled the files to be split and processed in a distributed manner, allowing us to leverage our big Spark clusters effectively. The columnar design also enabled selective reading of relevant columns during query execution, drastically reducing I/O operations and speeding up data processing.

Benefits of Using Parquet

The switch to Parquet had a transformative impact on our platform:

  1. Reduced Resource Usage: The improved storage efficiency and compression reduced the amount of hardware and computational resources required for data processing.

  2. Faster Data Processing: By storing data in Parquet, we dramatically decreased the processing time for Spark jobs. This allowed us to update ML models more frequently, improving their real-time prediction accuracy.

  3. Enhanced Scalability: As our data flow grew, Parquet’s efficient format allowed us to handle increased volumes without proportional increases in infrastructure costs.

  4. Empowered Data Scientists: The ability to process larger volumes of data during research and testing enabled our data scientists to refine and enhance all our ML models. Parquet’s schema evolution feature also allowed for seamless updates to data structures without breaking existing workflows.

Conclusion

By adopting Apache Parquet and following its best practices, we not only overcame the challenges of scaling our ad exchange platform but also improved the overall efficiency and quality of our ML models. The shift to Parquet enhanced our ability to react to real-time changes in data, optimized resource usage, and provided our data science team with the tools to innovate further. This experience underscored the value of choosing the right data storage format for high-scale, data-intensive applications.

bigdata Article's
30 articles in total
Favicon
Rethinking distributed systems: Composability, scalability
Favicon
When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability
Favicon
Using Apache Parquet to Optimize Data Handling in a Real-Time Ad Exchange Platform
Favicon
The Columnar Approach: A Deep Dive into Efficient Data Storage for Analytics 🚀
Favicon
Construyendo una aplicación con Change Data Capture (CDC) utilizando Debezium, Kafka y NiFi
Favicon
[Boost]
Favicon
Please read out this article
Favicon
Goodbye Kafka: Build a Low-Cost User Analysis System
Favicon
MapReduce - A Simplified Approach to Big Data Processing
Favicon
Query 1B Rows in PostgreSQL >25x Faster with Squirrels!
Favicon
Introduction to Hadoop:)
Favicon
Big Data Trends That Will Impact Your Business In 2025
Favicon
The Heart of DolphinScheduler: In-Depth Analysis of the Quartz Scheduling Framework
Favicon
SQL Filtering and Sorting with Real-life Examples
Favicon
Platform to practice PySpark Questions
Favicon
Big Data
Favicon
Introduction to Data lakes: The future of big data storage
Favicon
5 effektive Methoden, um Bilder aus Webseiten zu extrahieren
Favicon
The Apache Icebergâ„¢ Small File Problem
Favicon
System Design 09 - Data Partitioning: Dividing to Conquer Big Data
Favicon
Understanding Star Schema vs. Snowflake Schema
Favicon
How IoT and Big Data Work Together: A Powerful Synergy
Favicon
Why Pangaea X is the Go-To Freelance Platform for Data Analysts
Favicon
Introduction to Messaging Systems with Kafka
Favicon
Best Practices for Data Security in Big Data Projects
Favicon
🚀 Unlock the Power of ORC File Format 📊
Favicon
🚀 Real-time YouTube Comment Sentiment Analysis with Kafka, Spark, Docker, and Streamlit 🚀
Favicon
Bird Species
Favicon
SeaTunnel-Powered Data Integration: How 58 Group Handles Over 500 Billion+ Data Points Daily
Favicon
5 Big Data Use Cases that Retailers Fail to Use for Actionable Insights

Featured ones: