Logo

dev-resources.site

for different kinds of informations.

Mastering Apache Kafka: A Complete Guide to the Heart of Real-Time Data Streaming

Published at
11/22/2024
Categories
kafka
producerconsumer
apachekafka
Author
renukapatil
Categories
3 categories in total
kafka
open
producerconsumer
open
apachekafka
open
Author
11 person written this
renukapatil
open
Mastering Apache Kafka: A Complete Guide to the Heart of Real-Time Data Streaming

In today’s world, where real-time data drives business decisions and consumer experiences, mastering Kafka is essential for anyone working with large-scale data systems. Whether you're building scalable data pipelines, powering analytics, or developing real-time applications, Kafka is at the core of it all. But what exactly is Kafka? How does it work? And why is it so popular for handling massive streams of data?

In this comprehensive guide, we’ll unravel the mysteries of Kafka, from setting up a Kafka cluster with multiple brokers to understanding complex concepts like partitions, consumer offsets, and replication. Whether you're just getting started or looking to sharpen your skills, this guide will take you through every critical aspect of Kafka that you need to know to handle your real-time data challenges like a pro.

Let’s dive in!


Apache Kafka is a powerful distributed event-streaming platform, widely used for real-time data processing. For beginners, Kafka’s terminologies can feel overwhelming, but they are key to understanding how Kafka works. In this blog, we’ll demystify Kafka concepts such as Cluster, Broker, Producer, Consumer, Topics, Partitions, Streams, and Connect, and walk through their functionalities in a simple, step-by-step manner.


What is Apache Kafka?

Kafka is a distributed system designed to process large streams of data efficiently. It acts as a middleman, enabling data exchange between different systems in real-time. Imagine a newspaper delivery system: the producer is the printing press, the consumer is the reader, and Kafka is the delivery system ensuring newspapers reach on time.


Image description

Kafka Cluster

A Kafka Cluster is a group of Kafka Brokers working together.

  • Each broker is a Kafka server that handles read and write requests from clients and stores data.
  • For fault tolerance and scalability, multiple brokers collaborate in a cluster.
  • Example: Imagine multiple warehouses working together to store and distribute products. These warehouses are your brokers, and the collective system is the Kafka cluster.

Kafka Broker

A Broker is a single Kafka server.

  • Each broker has a unique ID and is responsible for storing specific portions of data.
  • Brokers distribute incoming data (messages) among themselves based on topics and partitions.
  • Even if one broker goes down, the cluster can recover using replicated data stored on other brokers.

Kafka Producer

A Producer is an application or system that sends new data to Kafka.

  • Think of the producer as the publisher of newspapers in our analogy.
  • It sends data to specific topics within the Kafka cluster. For instance:
    • A weather app sending live temperature data to Kafka.
    • An e-commerce website logging user activity for real-time analysis.

Kafka Consumer

A Consumer is an application that reads data from Kafka.

  • Consumers subscribe to specific topics and process the incoming data.
  • For example, a stock trading app might consume live market data to update prices on the user’s screen.

Kafka Topics

A Topic is a category or feed name to which messages are sent.

  • Topics are like tables in a database or folders in a file system.
  • Producers send data to topics, and consumers read data from them. Example: A "Weather" topic contains weather-related updates. A "Stock Prices" topic stores live market data.

Kafka Partitions

Partitions break a topic into smaller parts for scalability and fault tolerance.

Image description

  • Each topic is divided into one or more partitions.
  • Example: Think of partitions as pages of a book within a topic.
  • Each page holds a portion of the topic's data.
  • Benefits of partitions: Parallel processing: Multiple consumers can read from partitions simultaneously. Fault tolerance: Data is replicated across partitions for recovery during failures.

Partition Data Order
Kafka ensures the order of messages is maintained within each partition but not across the topic as a whole.


Kafka Connect

Kafka Connect allows you to integrate Kafka with other systems without writing code.

  • It’s used to move data in and out of Kafka, such as importing data from a database or exporting data to a data warehouse.
  • Example: If you want to sync data from your MySQL database into Kafka for real-time processing, Kafka Connect can handle this without requiring you to write complex scripts.

Kafka Streams

Kafka Streams is a library for building stream processing applications.

  • It allows you to transform, aggregate, or filter data as it flows through Kafka.
  • Example: Imagine you have a stream of purchase data. You can use Kafka Streams to calculate real-time sales trends, such as total revenue per minute.


Setting Up Kafka

Here’s a brief overview of how to set up Kafka on your system:

1.Download Apache Kafka

Visit the Apache Kafka website and download the latest version.
Extract the downloaded files to a folder on your computer.
and keep in "C" folder

2.Run Zookeeper

  • Kafka relies on Zookeeper for managing its brokers.
  • Start Zookeeper using the provided shell script
  • my file location is C:\kafka_2.13-3.9.0\bin\windows I am running my all commands from here.
zookeeper-server-start.bat ..\..\config\zookeeper.properties
Enter fullscreen mode Exit fullscreen mode

3.Start Kafka Broker

  • Start Kafka using the broker configuration:
kafka-server-start.bat ..\..\config\server.properties
Enter fullscreen mode Exit fullscreen mode

4.Create a Topic

  • Create a new topic to send and receive messages:
kafka-topics.bat --create --topic my-topic --bootstrap-server localhost:9092 --replication-factor 1 --partitions 3
Enter fullscreen mode Exit fullscreen mode

You will see this message is displayed after running the above command:

Image description

5.Produce and Consume Messages

  • Start a producer to send messages:
kafka-console-producer.bat --broker-list localhost:9092 --topic my-topic
Enter fullscreen mode Exit fullscreen mode
  • Start a consumer to read messages:
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic my-topic --from-beginning
Enter fullscreen mode Exit fullscreen mode

Now, you will have these 4 command prompts opened:

Image description

Let's play around the producer and consumer:
In producer prompt I produce the data like mango and gauva, and you will see consumer is consuming the data successfully!

Image description

  • Now, here are 2 things in the consumer command, if I say I dont want messages from the beginning, then my command will be
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic my-topic
Enter fullscreen mode Exit fullscreen mode

so, here we will get data from where we started out: consumer server that is sitafal and apple only we will get:

Image description


Sending Kafka Messages with Key via Command Line

In a Kafka setup, message ordering, partitioning, and the use of keys play critical roles in managing data flow and ensuring data integrity. Below, we discuss how Kafka messages behave with and without keys, followed by practical steps to implement these concepts using the command line.

Understanding Message Partitioning and Ordering

1.Without Keys:

When you send messages without a key, Kafka assigns them to partitions using a round-robin strategy. This ensures load balancing but doesn't maintain the message order across partitions.

Example: A topic has two partitions (P1, P2). Messages M1, M2, M3, M4 are distributed alternately:

  • M1 -> P1
  • M2 -> P2
  • M3 -> P1
  • M4 -> P2

In such scenarios, consumers read messages from random partitions, making message ordering unreliable.

2.With Keys:
When messages are sent with a key, Kafka determines the target partition by applying a hashing algorithm on the key. Messages with the same key always go to the same partition, ensuring ordered delivery for those keys.

  • When sending message with key, ordering will be maintained as they will be in the same partition
  • Without key we can not garuntee the ordering of message as consumer poll the message from all the paritions at the same time.

Example:

  • Key: order123, Messages: M1, M2, M3 -> All go to the same partition (e.g., P1).
  • Key: userXYZ, Messages: M4, M5 -> These go to another partition (e.g., P2).

Practical Command Line Steps

1.Starting Kafka:
Ensure Kafka brokers and ZooKeeper are running.

zookeeper-server-start.bat ..\..\config\zookeeper.properties

kafka-server-start.bat ..\..\config\server.properties

Enter fullscreen mode Exit fullscreen mode

2.Create a Topic:
Create a topic named my_topic with 4 partitions

kafka-topics.bat --create --topic fruits --bootstrap-server localhost:9092 --replication-factor 1 --partitions 4
Enter fullscreen mode Exit fullscreen mode

3.Start producer and consumer

producer:

kafka-console-producer.bat --broker-list localhost:9092 --topic fruits --property "key.separator=-" --property "parse.key=true"
Enter fullscreen mode Exit fullscreen mode

consumer:

kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic fruits --from-beginning -property "key.separator=-" --property "print.key=false"
Enter fullscreen mode Exit fullscreen mode
  1. Send Messages With Key: To send messages with keys, specify a key and value pair separated by a delimiter

Enter key-value pairs, such as:

hello-apple
hello-banana
hello-kiwi
bye-mango
bye-gauva

Enter fullscreen mode Exit fullscreen mode

5.Consume Messages:
To consume messages, use the Kafka consumer. For ordered consumption:

Image description


Key Takeaways

  • Use keys to ensure message ordering within partitions.
  • Understand the trade-offs: Round-robin ensures even distribution, while keys allow ordering but may lead to uneven partition loads.
  • Leverage Kafka's consumer offset management for reliable processing.

Advanced Concepts: Consumer Groups and Offsets
Kafka uses consumer offsets to track the progress of message consumption.

1.Offsets and Reliability:

Kafka maintains an internal topic (__consumer_offsets) that stores the latest offset for each partition a consumer group has processed. If a consumer fails and restarts, it resumes from the last committed offset.

2.Consumer Groups:

Multiple consumers in the same group divide partition consumption among themselves, ensuring efficient data processing. Consumers in different groups can independently consume messages from the same topic.

run zookeeper and server:

zookeeper-server-start.bat ..\..\config\zookeeper.properties

kafka-server-start.bat ..\..\config\server.properties

Enter fullscreen mode Exit fullscreen mode
kafka-topics.bat --bootstrap-server localhost:9092 --list
Enter fullscreen mode Exit fullscreen mode

Image description

now let's e consumer group:

kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic my-topic --from-beginning
Enter fullscreen mode Exit fullscreen mode

as soon as you start the consumer kafka grouped this into consumer group and create with new unique id

kafka-consumer-groups.bat --bootstrap-server localhost:9092 --list
Enter fullscreen mode Exit fullscreen mode

Image description

lets again start consumer it will create another consumer group and list those:

Image description

kafka-topics.bat --describe --topic my-topic --bootstrap-server localhost:9092
Enter fullscreen mode Exit fullscreen mode

Image description

Started producer:

kafka-console-producer.bat --broker-list localhost:9092 --topic my-topic
Enter fullscreen mode Exit fullscreen mode

and started 3 consumers with same command in 3 different prompt:

kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic my-topic --group console-consumer-93231
Enter fullscreen mode Exit fullscreen mode

Now we have 3 consumer and 1 producer

  1. I producer data 1, 2, 3, 4 -> the right-upper-corner in the image has consumed the data
  2. Then I stop left-bottom-corner consumer ->
  3. Then I produce data 5,6,7,8 -> right-bottom-corner cosumer cosumes the data

Image description


How to Set Up a Kafka Cluster with Three Brokers: A Step-by-Step Guide

Image description

Kafka, a powerful distributed event streaming platform, works by allowing applications to publish and subscribe to streams of records in real-time. To understand how to efficiently scale Kafka for production environments, it's crucial to set up a Kafka cluster with multiple brokers. In this blog, we’ll walk you through the process of setting up a Kafka cluster with three brokers.

Step 1: Understanding Kafka Clusters
A Kafka cluster is essentially a collection of Kafka brokers that work together to provide a highly available and fault-tolerant messaging system. Each broker manages a portion of the data, with topics divided into partitions across the brokers in the cluster. Replication ensures that each partition is copied across multiple brokers for fault tolerance.

Step 2: Setting Up the Broker Configuration
The process of starting a Kafka broker involves configuring a server.properties file for each broker in the cluster. Here’s how you can configure three brokers:

1.Create Three Config Files: For three brokers, you need to create three different server.properties files, each with unique configurations for:

Broker ID (0, 1, 2)
Port numbers for communication
Log directories for storing logs
Example:

Broker 0: server.properties with broker ID 0, port 9092, and unique log directory.
Broker 1: server1.properties with broker ID 1, port 9093, and another log directory.
Broker 2: server2.properties with broker ID 2, port 9094, and a separate log directory.

Image description

2.Configure the Brokers: Each server.properties file must specify unique broker IDs, ports, and log directories. You can modify these in the configuration files by editing them using a text editor like Visual Studio Code.

3.Start the Brokers: Once the configuration files are ready, you can start each broker by running the Kafka server using the command:

For example:

Broker 0: kafka-server-start.bat ..\..\config\server.properties
Broker 1: kafka-server-start.bat ..\..\config\server1.properties
Broker 2: kafka-server-start.bat ..\..\config\server2.properties

Image description

After starting the brokers, you will see them running on their respective ports, ready to accept connections.

Step 3: Creating Topics and Setting Replication Factor
Once the brokers are running, it’s time to create a Kafka topic and configure its replication factor. A replication factor defines how many copies of a topic’s partitions are maintained across the brokers.

you must have your zookeeper in running state:

zookeeper-server-start.bat ..\..\config\zookeeper.properties
Enter fullscreen mode Exit fullscreen mode

Here’s how you can create a topic with a replication factor of three and three partitions:

kafka-topics.bat --create --topic gadgets --bootstrap-server localhost:9092,localhost:9093,localhost:9094 --replication-factor 3 --partitions 3
Enter fullscreen mode Exit fullscreen mode

Image description

In this example, the replication factor of three ensures that the topic's data is replicated across all three brokers in the cluster. This is essential for fault tolerance, as it allows Kafka to maintain data availability even if one broker fails.

Step 4: Producing and Consuming Messages
After creating the topic, you can test the setup by producing and consuming messages.

1.Producer: Use the Kafka producer to send messages to the topic:

kafka-console-producer.bat --bootstrap-server localhost:9092,localhost:9093,localhost:9094 --topic gadgets
Enter fullscreen mode Exit fullscreen mode

You can send messages such as Hello, Laptop, Mouse, and Monitor, and they will be published to the topic’s partitions.

2.Consumer: Use the Kafka consumer to read messages from the topic:

kafka-console-consumer.bat --bootstrap-server localhost:9092,localhost:9093,localhost:9094 --topic gadgets --from-beginning
Enter fullscreen mode Exit fullscreen mode

Image description

You will see in C:\tmp\kafka-logs there are 3 folders created, and same in C:\tmp\kafka-logs1 and C:\tmp\kafka-logs2

Image description

As you can see in the tutorial, the data will be replicated across the brokers. For instance, if a message is sent to Partition 0, it will be replicated to Brokers 0 and 1. Similarly, if it is sent to Partition 1, it will be replicated to Brokers 1 and 2, and so on.

Step 5: Understanding In-Sync Replicas (ISR)
In Kafka, the In-Sync Replica (ISR) is a critical concept. It refers to the set of replicas for a given partition that are fully caught up with the leader of that partition. A partition leader is responsible for handling all read and write requests for that partition.

1.How ISR Works: Each partition in Kafka has one leader and several followers (replicas). The leader manages the data, and the followers replicate the data. The ISR ensures that data is replicated properly across the brokers.

2.Failover Mechanism: If a broker in the ISR fails, Kafka automatically promotes one of the followers to be the new leader, ensuring that data is always available. This is critical for maintaining high availability and reliability.

You can check the ISR status of a topic using the following command:

kafka-topics.bat --describe --topic gadgets --bootstrap-server localhost:9092,localhost:9093,localhost:9094
Enter fullscreen mode Exit fullscreen mode

Image description

Step 6: Testing Failover
To test the failover mechanism, you can shut down one of the brokers in the cluster and observe how the ISR adjusts. The leader for each partition will shift, and the ISR will ensure that replicas are always up to date.

Image description

Conclusion
In this tutorial, we’ve successfully set up a Kafka cluster with three brokers, created a topic with a replication factor of three, and tested producing and consuming messages. Understanding how Kafka manages replication, partitioning, and failover is essential for building reliable and scalable event-driven systems. With the power of Kafka's fault tolerance mechanisms, you can confidently deploy Kafka clusters to handle high-throughput, real-time data streams.

You can continue learning by exploring the next video on the importance of In-Sync Replicas (ISR) in Kafka, which further explains how Kafka ensures data consistency and availability across brokers.


I have explained the key topics listed above, but here’s a brief summary of each one for clarity:

1.Kafka Cluster, Kafka Broker, Producer, Consumer:

Kafka Cluster: A group of Kafka brokers that work together to handle large streams of data. A Kafka cluster allows distributed processing and scaling.
Kafka Broker: A Kafka broker is a server in the Kafka ecosystem that stores data and serves client requests (like producers and consumers).
Producer: A producer sends messages to Kafka topics. It can write to multiple partitions of a topic.
Consumer: A consumer reads messages from Kafka topics. Consumers can join together in consumer groups to distribute the processing of messages.

2.Kafka Topic and Partition:

Topic: A category or feed name to which messages are sent by producers and from which consumers read. Topics are the main mechanism Kafka uses to organize messages.
Partition: A topic can have multiple partitions to allow parallel processing and data replication. Each partition is an ordered log and helps distribute data across Kafka brokers.

3.How to send Kafka message from command line (With Key):

Kafka provides command-line tools (such askafka-console-producer and kafka-console-consumer) that allow users to send and consume messages. Producers can include a key for messages, which can be used for routing to specific partitions.

4.Understanding Consumer Offset, Consumer Groups, and...

Consumer Offset: Kafka keeps track of each consumer's progress using an offset, which indicates the position in the log (the message the consumer is currently reading).
Consumer Groups: A group of consumers that work together to consume data from topics. Kafka ensures that each partition is read by only one consumer in a group at a time. If multiple consumers are in a group, the topic’s partitions are split between them.

5.Master the Art of Kafka: A Step-by-Step Consumer Offset and...

This would involve understanding how to manage consumer offsets, either by relying on Kafka’s default offset management or manually committing offsets based on business logic. Proper offset management ensures that consumers can resume reading from the correct point after a failure.

6.Kafka Fundamentals: Understanding Segments,...

Kafka stores messages in segments within log files. Each partition’s log is split into segments to handle efficient storage and retrieval. Over time, old segments are deleted based on configuration settings like retention policy.

7.How to Make a Kafka Cluster with 3 Brokers: Understand Replication Factor

This involves setting up a Kafka cluster with multiple brokers (like 3), where each broker stores a portion of the data. The replication factor determines how many copies of each partition will exist across the brokers, which ensures high availability and fault tolerance.

8.ISR in Kafka (In Sync Replica):

ISR (In-Sync Replicas) refers to the set of replicas for a partition that are fully caught up with the leader replica (i.e., they have the same data). ISR ensures that only replicas that are up to date are eligible to become the leader of a partition.
These topics form the core knowledge required to understand and work with Apache Kafka, whether you're setting up a Kafka cluster, producing and consuming messages, or dealing with more advanced concepts like consumer groups, replication, and offsets.

If you would like to dive deeper into any specific topic or need more detailed examples, you can see this blog (Kafka Producer and Consumer Example in .NET 6 with ASP.NET Core)!

Happy Learning!

apachekafka Article's
30 articles in total
Favicon
Mastering Apache Kafka: A Complete Guide to the Heart of Real-Time Data Streaming
Favicon
AIM Weekly for 11/11/2024
Favicon
Apache Kafka: A Simple Guide to Messaging and Streaming
Favicon
Design a real-time data processing
Favicon
Building a Scalable Data Pipeline with Apache Kafka
Favicon
Building a Scalable Data Pipeline with Apache Kafka
Favicon
Implementing AI with Scikit-Learn and Kafka: A Complete Guide
Favicon
Understanding the Importance of Kafka in High-Volume Data Environments
Favicon
How can i stop my kafka consumer from consuming messages ?
Favicon
Getting Started with Apache Kafka: A Beginner's Guide to Distributed Event Streaming
Favicon
🚀 Apache Kafka Cluster Explained: Core Concepts and Architectures 🌐
Favicon
WarpStream Newsletter #5: Dealing with Rejection, Schema Validation, and Time Lag
Favicon
Dealing with rejection (in distributed systems)
Favicon
Apache Kafka on Amazon Linux EC2
Favicon
Announcing WarpStream Schema Validation
Favicon
The Kafka Metric You’re Not Using: Stop Counting Messages, Start Measuring Time
Favicon
WarpStream Newsletter #4: Data Pipelines, Zero Disks, BYOC and More
Favicon
Integrating Apache Kafka with Apache AGE for Real-Time Graph Processing
Favicon
Integrating Apache Kafka with Apache AGE for Real-Time Graph Processing
Favicon
Multiple Regions, Single Pane of Glass
Favicon
FLaNK-AIM: 20 May 2024 Weekly
Favicon
Secure by default: How WarpStream’s BYOC deployment model secures the most sensitive workloads
Favicon
Zero Disks is Better (for Kafka)
Favicon
FLaNK AI-April 22, 2024
Favicon
Pixel Federation Powers Mobile Analytics Platform with WarpStream, saves 83% over MSK
Favicon
FLaNK AI - 15 April 2024
Favicon
WarpStream Newsletter #3: Always Be Shipping
Favicon
Introducing WarpStream Managed Data Pipelines for BYOC Clusters
Favicon
Apache Kafka
Favicon
FLaNK-AIM Weekly 06 May 2024

Featured ones: