dev-resources.site
for different kinds of informations.
How to Upgrade Kafka from 1.1.1 with Zero-Downtime: An Applicable Approach
As a data engineer or, more specifically, data platform engineer, a service with high dependency may be handed over to you. Upgrading such a service is a terrifying process. Suppose that service is Kafka, and it's the main component of your data stack at the company. However, the solution isn't ignoring the complexity because every bug fix or new feature can save you from downtime and help you increase the performance of the services. So, what is the solution? How can we ensure all services that depend on Kafka work fine after the upgrade? In this post, I will share my experience through this process.
Main concerns
When we talk about services like Kafka, we know many producers and consumers are in between. So, what happens to them after an upgrade? Do they continue to produce/consume? What about the schema registry and other components that depend on Kafka? So, one of the main concerns is the healthiness of the dependent element.
Also, we want to upgrade Kafka for two significant versions; how should we check deprecated configs? Should I read all the changelogs one by one? There is a better approach that minimizes the time spent and the probability of downtime.
Proposed approach
Honestly, every time I think about Docker, I wonder what a beautiful tool this is :D You know? Amazingly, you can independently set up a whole stack in a separate network with tools like docker-compose.
A better approach is to use Docker to simulate production services in a safe environment. We can set up a whole stack with the same configs but fewer resources, simulate upgrades, and check each component's behavior.
Applied approach for Kafka
To simulate the upgrade process for Kafka, I am supposed to create a stack including these components:
- Zookeeper Instances -> Coordinator for Kafka Cluster
- Kafka Instances -> Main component
- Schema Registry -> Persist schema of produced messages
- Kafka UI -> Monitor Kafka cluster and see incoming messages in topics
- Producers -> Python code to produce data into Kafka topic in Avro format.
- Consumer -> Python code to consume data produced by
Producer
. - Clickhouse -> Analytical database to store data coming from Kafka
- Postgres -> OLTP database stores transactional data
- Postgres Producer -> Python code, which Inserts one record every 0.1 seconds into the
Postgres
database - Debezium -> Capture each change in
Postgres
and send it to the corresponding Kafka topic in Avro format.
Now, it's time to prepare the appropriate docker-compose.yaml
Implement detail
-
Zookeeper
- Image:Ā Official Image
- Configs:
- MountĀ
zoo.cfg
Ā into the container -
myid
Ā and data directoryĀ created usingĀzookeeper_conf_creator.py
- MountĀ
-
Kafka
- Image: Bitnami Image
- Image customized byĀ
Dockerfile-Kafka
Ā (https://hub.docker.com/r/bitnami/kafka)
- Image customized byĀ
- Configs:
- Set as an environment variable
-
server.properties
Ā converted toĀserver.env
Ā usingĀkafka_env_creator.py
- This image didn't supportĀ
SCRAM-SHA
Ā for authentication. SoĀlibkafka.sh
Ā (which is bitnami's Kafka library), rewritten.
- Image: Bitnami Image
-
Schema Registry:
- Image:Ā Official Image
- Configs:
- Set as an environment variable
-
schema-registry.properties
Ā converted toĀschema-registry.env
Ā usingĀschema_registry_config_creator.py
-
Kafka UI
- Image:Ā Official Image
- Configs:
- Set as an environment variable
- Directly inĀ
docker-compose.yaml
- *Producer and Consumer *
- Image:Ā Official Python Image
- Image customized byĀ
Dockerfile-Python
- Image customized byĀ
- Code:Ā
producer.py
Ā andĀconsumer.py
- Configs: Set as environment variables directly inĀ
docker-compose.yaml
- Image:Ā Official Python Image
-
Clickhouse
- Image:Ā Official Image
- Tables: Tables DDL definedĀ hereĀ and then mounted intoĀ
/docker-entrypoint-initdb.d
- For each table inĀ
Postgres
Ā three tables are defined here:- Base table -> data persist here
- Kafka table -> read data from kafka
- Materialize view -> ship data from the Kafka table into the base table.
- For each table inĀ
- Configs: Default configs used onlyĀ
kafka.xml
Ā mounted intoĀ/etc/clickhouse-server/config.d/kafka.xml
- Logs: For debug purposes, logs mounted intoĀ local directory
-
Postgres
- Image:Ā Debezium Example Image
- This Postgres contains sample sale data.
- Image:Ā Debezium Example Image
-
Postgres Producer
- Same asĀ
Producer
Ā andĀConsumer
Ā butĀ this codeĀ used
- Same asĀ
-
Debezium
- Image:Ā Official Image
- Configs:
- Set as an environment variable
-
kafka-connect.properties
Ā converted toĀkafka-connect.env
Ā usingĀkafka_connect_config_generator.py
Some Extra Containers
-
kafka-setup-user
- It uses the same image asĀ
Kafka
; it runs afterĀkafka1
Ā becomes healthy. Some users are created after this container runs (exit with status 0). See themĀ here - It needs one Kafka broker and also a Zookeeper cluster becauseĀ
SCRAM-SHA
Ā needs to persist on Zookeeper.
- It uses the same image asĀ
-
kafka-setup-topic
- It uses the same image asĀ
Kafka
Ā and creates some topics. See the listĀ here
- It uses the same image asĀ
-
submit-connector
- It useĀ curl imageĀ to submitĀ this connectorĀ intoĀ
Debezium
. The connector captures the changes inĀPostgres
, sends events toĀKafka
, and thenĀClickhouse
Ā consumes the data into appropriate tables.
- It useĀ curl imageĀ to submitĀ this connectorĀ intoĀ
Some Extra Notes:
The version of all containers defined in theĀ
.env
Ā file. You can change them from this file.Container dependencies are defined accurately. So, if one container depends on another to come up, appropriateĀ
healthcheck
Ā andĀdepends_on
Ā conditions are defined for it.If you take a look at theĀ
healthcheck
Ā of containers, for example, kafka, you see this command:
(echo > /dev/tcp/kafka1/9092) &>/dev/null && exit 0 || exit 1
This shell script helps check the TCP port in a container withoutĀ telnet
.
Simulation Process
To run the simulation, you can follow these steps.
Result
All tests were successful. By successful, I mean the producer can still produce messages without errors, and consumers can consume messages without errors. No other criteria were investigated; you can define your metrics for this simulation. Only one problem was seen in this process.
Problems:
- In
Setup Kafka User
:java.lang.ClassNotFoundException: kafka.security.auth.SimpleAclAuthorizer
occurred- It deprecated after 2.4.0. See here
- Doc recommends to use
kafka.security.authorizer.AclAuthorizer
instead. It's fully compatible with deprecated class, so it was replaced in docker-compose and it worked
Conclusion:
- As there is the official document for upgrading from any version to 3.6.1 (and another previous version), there is no obstacle in this process. Also, our test shows this process works, and we can upgrade our Kafka to whatever version we want.
Conclusion
This article is a suggestion for the best approach for upgrading highly dependent services. We talked about the details of implementing this process, and then, as we saw in the Result section, one problem was found before upgrading so we can upgrade our Kafka cluster seamlessly, with zero-downtime :)
Featured ones: