Logo

dev-resources.site

for different kinds of informations.

SeaTunnel-Powered Data Integration: How 58 Group Handles Over 500 Billion+ Data Points Daily

Published at
11/20/2024
Categories
datascience
apacheseatunnel
opensource
bigdata
Author
seatunnel
Author
9 person written this
seatunnel
open
SeaTunnel-Powered Data Integration: How 58 Group Handles Over 500 Billion+ Data Points Daily

Introduction

In the digital age, data has become one of the most valuable assets for businesses. As a leading lifestyle service platform in China, 58 Group has been continuously exploring and innovating in the construction of its data integration platform. This article will detail the architectural evolution, optimization strategies, and future plans of 58 Group's data integration platform based on Apache SeaTunnel.

Challenges of the Data Integration Platform

Business Background

58 Group has a wide range of businesses, and with the rapid development of these businesses, the scale of data from various business areas such as recruitment, real estate, second-hand housing, second-hand markets, local services, and information security has increased significantly. 58 Group needs to facilitate the flow and convergence of data between different data sources to achieve unified management, circulation, and sharing of data. This involves not only the collection, distribution, and storage of data but also applications such as offline computing, cross-cluster synchronization, and user profiling.

Image description

Currently, 58 Group processes over 500 billion messages daily, with peak message processing reaching over 20 million, and the number of tasks reaching over 1600. Handling such a massive volume of data presents significant challenges.

Image description

Challenges

In facilitating the flow and convergence of data between different sources and achieving unified management, circulation, and sharing of data, 58 Group faces challenges including:

  • High Reliability: Ensuring data is not lost under various fault conditions, ensuring data consistency, and stable operation of tasks.
  • High Throughput: Handling large-scale data streams to achieve high concurrency and bulk data transfer.
  • Low Latency: Meeting the business needs for real-time data processing and rapid response.
  • Ease of Maintenance: Simplifying configuration and automating monitoring to reduce maintenance burdens, facilitate quick fault detection and resolution, and ensure long-term system availability.

The Evolution of Architecture

The architecture of 58 Group's data integration platform has undergone multiple evolutions to adapt to changing business needs and technological developments.

Image description

Early Architecture Overview

  • 2017: Used Flume for platform integration management.
  • 2018: Introduced Kafka Connect 1.0.
  • 2020: Used Kafka Connect 2.4 version, achieving incremental load balancing and CDC (Change Data Capture).
  • 2023: Introduced Apache SeaTunnel, integrated into the real-time computing platform, and expanded various Source/Sink.

Image description

From 2017 to 2018, 58 Group's data integration platform adopted the Kafka Connect architecture, based on Kafka's data integration, with scalability and distributed processing horizontally expanded, supporting the operation of Workers and Tasks on multiple nodes; Workers automatically redistribute tasks to other Workers upon failure, achieving high availability; it also supports automated offset management and Rest API task and configuration management.

However, with the expansion of business volume and diversification of scenarios, this architecture encountered bottlenecks:

  1. Architectural Limitations
    • Inability to achieve end-to-end data integration.
  2. Coordinator Bottleneck Issues
    • Heartbeat Timeout: Worker-to-coordinator heartbeat timeouts trigger task rebalancing, causing temporary task interruptions.
    • Heartbeat Pressure: Workers synchronize with coordinators, tracking worker states and managing a large amount of task metadata.
    • Coordinator Failure: Coordinator downtime affects task allocation and reallocation, causing task failures and decreased processing efficiency.
  3. Impact of Task Rebalance
    • Task Pause and Resume: Each rebalance pauses tasks, then reallocates them, leading to brief task interruptions.
    • Rebalance Storms: If multiple worker nodes frequently join or exit the cluster, or if network jitter causes heartbeat timeouts, frequent Rebalance can significantly affect task processing efficiency, leading to delays.

Given these shortcomings, 58 Group introduced Apache SeaTunnel in 2023, integrating it into the real-time computing platform to freely expand various Source/Sink.

Current Architecture

Currently, 58 Group's data integration platform, based on the Apache SeaTunnel engine, integrates Source data sources (Kafka, Pulsar, WMB, Hive, etc.), processes them through SeaTunnel's built-in Transform features, and Sinks them to destination databases (Hive, HDFS, Kafka, Pulsar, WMB, MySQL, SR, Redis, HBASE, Wtable, MongoDB, etc.), achieving efficient task management, status management, task monitoring, intelligent diagnostics, and more.

Image description

Smooth Migration and Performance Tuning

Smooth Migration

When introducing Apache SeaTunnel, 58 Group needed to perform a smooth migration of the data integration platform to minimize the impact on users or business and ensure data consistency, maintaining format consistency, path consistency, and no data loss.

This goal presented challenges, including the cost and risks of migration, such as understanding and confirming the format of each task's data source, and the migration involving multiple steps, which is complex and time-consuming.

To address this, 58 Group took the following measures:

  1. For sources, add RawDeserializationSchema to be compatible with unstructured data.
  2. For destinations, such as using hdfs sink for hive, to be compatible with partition loading and paths.
  3. Develop automatic migration tools:
    • Automatically generate task configurations, generate corresponding SeaTunnel task configurations based on Kafka Connect configurations.
    • Take down the original tasks, reset offsets, and start new tasks.
    • Verification and checking.

Performance Tuning

58 Group also carried out several performance optimizations on the data integration platform, including:

  • Adding Pulsar Sink Connector: To increase throughput.
  • Supporting Array Data: Enhancing HbaseSink compatibility.
  • Supporting Expiration Time Setting: Optimizing RedisSink.
  • Increasing PulsarSource Throughput: Optimizing the compression method of file connectors.
  • Fixing KafkaSource Parsing Issues: Enhancing the configuration flexibility of Kafka clients.

Image description

Monitoring and Operations Automation

Additionally, 58 Group improved the stability and efficiency of the data integration platform through monitoring and operations automation:

  • Task Monitoring: Real-time monitoring of task status to quickly detect and resolve failures.
  • Operations Automation: Reducing manual intervention through automated tools to increase operational efficiency.

Future Plans

58 Group has clear plans for the future development of the data integration platform:

  • Continuously Improve Intelligent Diagnostics: Enhance fault diagnosis accuracy and efficiency through machine learning and artificial intelligence technologies.
  • Cloud and Containerization Upgrade: Migrate the data integration platform to the cloud environment and implement containerized deployment to improve resource utilization and flexibility.

Conclusion

The architectural evolution and optimization of 58's data integration platform is a continuous process of iteration and innovation. Through continuous technological exploration and practice, 58 Group has successfully built an efficient, stable, and scalable data integration platform based on Apache SeaTunnel, providing strong data support for business development. In the future, 58 Group will continue to delve deeper into the field of data integration to provide better services for users.

bigdata Article's
30 articles in total
Favicon
Rethinking distributed systems: Composability, scalability
Favicon
When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability
Favicon
Using Apache Parquet to Optimize Data Handling in a Real-Time Ad Exchange Platform
Favicon
The Columnar Approach: A Deep Dive into Efficient Data Storage for Analytics πŸš€
Favicon
Construyendo una aplicaciΓ³n con Change Data Capture (CDC) utilizando Debezium, Kafka y NiFi
Favicon
[Boost]
Favicon
Please read out this article
Favicon
Goodbye Kafka: Build a Low-Cost User Analysis System
Favicon
MapReduce - A Simplified Approach to Big Data Processing
Favicon
Query 1B Rows in PostgreSQL >25x Faster with Squirrels!
Favicon
Introduction to Hadoop:)
Favicon
Big Data Trends That Will Impact Your Business In 2025
Favicon
The Heart of DolphinScheduler: In-Depth Analysis of the Quartz Scheduling Framework
Favicon
SQL Filtering and Sorting with Real-life Examples
Favicon
Platform to practice PySpark Questions
Favicon
Big Data
Favicon
Introduction to Data lakes: The future of big data storage
Favicon
5 effektive Methoden, um Bilder aus Webseiten zu extrahieren
Favicon
The Apache Icebergβ„’ Small File Problem
Favicon
System Design 09 - Data Partitioning: Dividing to Conquer Big Data
Favicon
Understanding Star Schema vs. Snowflake Schema
Favicon
How IoT and Big Data Work Together: A Powerful Synergy
Favicon
Why Pangaea X is the Go-To Freelance Platform for Data Analysts
Favicon
Introduction to Messaging Systems with Kafka
Favicon
Best Practices for Data Security in Big Data Projects
Favicon
πŸš€ Unlock the Power of ORC File Format πŸ“Š
Favicon
πŸš€ Real-time YouTube Comment Sentiment Analysis with Kafka, Spark, Docker, and Streamlit πŸš€
Favicon
Bird Species
Favicon
SeaTunnel-Powered Data Integration: How 58 Group Handles Over 500 Billion+ Data Points Daily
Favicon
5 Big Data Use Cases that Retailers Fail to Use for Actionable Insights

Featured ones: