Logo

dev-resources.site

for different kinds of informations.

Goodbye Kafka: Build a Low-Cost User Analysis System

Published at
12/5/2024
Categories
database
kafka
bigdata
Author
ksanaka
Categories
3 categories in total
database
open
kafka
open
bigdata
open
Author
7 person written this
ksanaka
open
Goodbye Kafka: Build a Low-Cost User Analysis System

User behavior data is a vital source for data warehouses and a key asset for businesses. It typically includes two main sources: behavior logs and upstream relational databases (e.g., MySQL). These data enable user growth analysis, behavior research, and precise troubleshooting of user issues.

Challenges in User Behavior Data Analysis

The unique characteristics of user behavior data analysis make building a scalable, flexible, and cost-effective architecture challenging. Key difficulties include:

  • High Traffic and Large Volume: Massive data generation requires robust storage and analysis capabilities.
  • Diverse Analysis Needs: Supports both static BI reporting and flexible Ad-hoc queries.
  • Varied Data Formats: Includes both structured and semi-structured data (e.g., JSON).
  • Real-Time Requirements: Rapid responses to user behavior for timely feedback.

Due to these complexities, most startups and small-to-medium businesses often start with general-purpose tracking systems like Google Analytics or Mixpanel. These systems automatically collect and upload tracking data by embedding JSON code on websites or SDKs in apps, generating metrics like visits, session duration, and conversion funnels.

Image description
Google Analytics

While general-purpose tracking systems are simple and easy to use, they have the following drawbacks:

  • Lack of Detailed Data: These systems typically donโ€™t provide detailed access logs, limiting users to predefined reports in the UI.
  • Limited Custom Querying: Without standard SQL interfaces, creating complex Ad-hoc queries becomes difficult for data scientists.
  • Rapidly Rising Costs: With tiered pricing models, costs can double at higher tiers. As traffic grows, querying larger datasets leads to significant expense increases.

Complexities of Building a Self-Hosted User Behavior Analysis System

Image description

To overcome the limitations of general tracking systems, many businesses choose to build their own user behavior analysis systems as they scale. Traditional self-hosted architectures are often based on the Hadoop ecosystem, with a typical workflow as follows:

  1. Embed SDKs in clients (apps or websites) to collect user activity logs.
  2. Use an activity gateway to gather logs from clients and forward them to the Kafka message bus.
  3. Store logs in computation engines like Hive or Spark via Kafka.
  4. Import data into a data warehouse using ETL tools to generate user behavior analysis reports.

Image description

While this architecture meets functional requirements, it is highly complex and costly to maintain:

  • Kafka relies on Zookeeper and requires SSDs for performance.
  • Kafka-connect is needed to move data from Kafka to the data warehouse.
  • Spark runs on YARN, and ETL processes require Airflow management.
  • When Hive storage reaches its limit, MySQL may need to be replaced with distributed databases like TiDB.

This architecture demands significant technical team resources and greatly increases operational burdens. In a business environment focused on cost reduction and efficiency, traditional Hadoop architectures are no longer suitable for simple, efficient use cases.

New Option: Lightweight User Behavior Analysis with Databend Cloud

With technological advancements, businesses now have a new option when designing user behavior tracking architectures. Databend Cloud offers an efficient and cost-effective solution for user behavior analysis, thanks to its simple architecture and flexibility.

Databend Cloud Architecture Features

  • 100% object storage-based with complete storage-compute separation, significantly reducing storage costs.
  • Query engine written in Rust for high performance and low cost. It automatically enters sleep mode when compute resources are idle, avoiding extra charges.
  • Fully supports ANSI SQL and semi-structured data analysis (JSON and custom UDFs). Complex JSON data can be analyzed using built-in JSON analysis capabilities or custom UDFs.
  • Built-in task scheduling for ETL, completely stateless, and automatically scalable.

Image description

Typical Architecture Implementation
Businesses can quickly set up a user behavior analysis system with the following process:

  • Log Collection and Storage:Kafka is no longer needed; users can directly store tracking logs in S3 in NDJSON format using Vector.
  • Data Ingestion and Processing:Create a copy task in Databend Cloud to automatically pull logs from S3. Often, S3 serves as a stage in Databend Cloud, where data is automatically ingested for processing and can be exported back to S3.
  • Query and Report Analysis: Run BI reports or ad-hoc queries using the warehouse, which automatically sleeps when idle, incurring no costs during downtime.

Use Case

A typical internet application company had a user behavior analysis scenario and chose Databend Cloud for building their analysis system. After adopting Databend Cloud, the company abandoned Kafka and directly created a stage in Databend Cloud to store user behavior logs in S3. They then used a task to ingest the logs into Databend Cloud. The company completed the POC in just one afternoon, transitioning from a complex Hadoop architecture to Databend Cloud, significantly simplifying maintenance and operational costs.

The preparation required from the user was straightforward. First, they set up two warehouses โ€” one for task-based data ingestion and one for BI report queries. Typically, a smaller warehouse is used for data ingestion, while a larger warehouse is used for queries. This setup helps save costs since queries are not run continuously.

Image description

Next, click Connect to obtain a connection string, which can be used in BI reports for querying. Databend provides drivers for various programming languages.

The remaining setup involves three steps:

  1. Create a table with fields matching the NDJSON log format.
  2. Create a stage to link the S3 directory containing the user behavior logs.
  3. Create a task that runs every minute or ten seconds. This task will automatically ingest files from the stage and clean them up afterward.

Once the setup is complete, user behavior logs will continuously be ingested.

Image description

Comparisons

By comparing general tracking systems, traditional Hadoop architectures, and Databend Cloud, the advantages of Databend Cloud are clear:

  • Architectural Simplicity: Eliminates the need for complex big data ecosystems, such as Kafka and Airflow.
  • Cost Optimization: Leverages object storage and elastic computing to achieve low-cost storage and analysis.
  • Flexibility and Performance:Supports high-performance SQL queries to meet diverse business scenarios.

Additionally, Databend Cloud provides a snapshot mechanism with time travel, ensuring data security and recoverability.

When building a user behavior tracking system, maintenance costs are as important as storage and compute costs. Databendโ€™s architecture, which separates storage and compute, simplifies traditional user behavior data analysis systems. Enterprises can easily build a high-performance, low-cost tracking and analysis architecture, optimizing the entire process from data collection to analysis. This solution helps businesses reduce costs while maximizing data value.

bigdata Article's
30 articles in total
Favicon
Rethinking distributed systems: Composability, scalability
Favicon
When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability
Favicon
Using Apache Parquet to Optimize Data Handling in a Real-Time Ad Exchange Platform
Favicon
The Columnar Approach: A Deep Dive into Efficient Data Storage for Analytics ๐Ÿš€
Favicon
Construyendo una aplicaciรณn con Change Data Capture (CDC) utilizando Debezium, Kafka y NiFi
Favicon
[Boost]
Favicon
Please read out this article
Favicon
Goodbye Kafka: Build a Low-Cost User Analysis System
Favicon
MapReduce - A Simplified Approach to Big Data Processing
Favicon
Query 1B Rows in PostgreSQL >25x Faster with Squirrels!
Favicon
Introduction to Hadoop:)
Favicon
Big Data Trends That Will Impact Your Business In 2025
Favicon
The Heart of DolphinScheduler: In-Depth Analysis of the Quartz Scheduling Framework
Favicon
SQL Filtering and Sorting with Real-life Examples
Favicon
Platform to practice PySpark Questions
Favicon
Big Data
Favicon
Introduction to Data lakes: The future of big data storage
Favicon
5 effektive Methoden, um Bilder aus Webseiten zu extrahieren
Favicon
The Apache Icebergโ„ข Small File Problem
Favicon
System Design 09 - Data Partitioning: Dividing to Conquer Big Data
Favicon
Understanding Star Schema vs. Snowflake Schema
Favicon
How IoT and Big Data Work Together: A Powerful Synergy
Favicon
Why Pangaea X is the Go-To Freelance Platform for Data Analysts
Favicon
Introduction to Messaging Systems with Kafka
Favicon
Best Practices for Data Security in Big Data Projects
Favicon
๐Ÿš€ Unlock the Power of ORC File Format ๐Ÿ“Š
Favicon
๐Ÿš€ Real-time YouTube Comment Sentiment Analysis with Kafka, Spark, Docker, and Streamlit ๐Ÿš€
Favicon
Bird Species
Favicon
SeaTunnel-Powered Data Integration: How 58 Group Handles Over 500 Billion+ Data Points Daily
Favicon
5 Big Data Use Cases that Retailers Fail to Use for Actionable Insights

Featured ones: