Logo

dev-resources.site

for different kinds of informations.

Announcing WarpStream Schema Validation

Published at
7/29/2024
Categories
apachekafka
dataengineering
realtimestreamingdat
datastreaming
Author
warpstream
Author
10 person written this
warpstream
open
Announcing WarpStream Schema Validation

by Brian Shih

Why do we need schemas?

Schemas in Apache Kafka® enable operators to ensure that their data conforms to the expected schema and prevent data quality and compliance issues, such as rogue producers writing data to Kafka topics that shouldn’t be there. This can be problematic in cases where downstream applications expect only to receive records that conform to a specific schema and a specific format, e.g., ETL applications that write to a database.

Schemas have become ubiquitous in Kafka and are a key component of any data governance, compliance, and platform management regime.

How Schemas Work in Kafka

Historically, schemas in Kafka have been implemented as a client-side feature to reduce the load on the stateful Kafka Brokers. Kafka uses an external server (a Schema Registry) to store schemas, and the producer client periodically polls the registry and caches the schemas and their IDs. Before writing to Kafka, the producer client serializes the data and validates that the record is compatible with the schema retrieved from the Schema Registry. If the record is incompatible with the schema, the serializer throws an error, and the producer does not produce the record for Kafka, which protects against incorrect data being written to Kafka from our producer. If the record is compatible, the producer writes the data with a schema ID and prepends the schema to the record. On the other side, consumers look up the schema from the Schema Registry, and if the schema on the record is compatible with the schema in the Schema Registry, the consumer deserializes the record. If not, the consumer throws an error.

While this implementation satisfies the basic requirement to add schemas to records in Kafka, it lacks broker-side validation, meaning schemas are entirely a client-side feature. That’s problematic because it relies on clients to always do the right thing. The Kafka broker will happily accept whatever it’s given by any Kafka client, so while the client can validate that its own writes and reads conform to the proper schema, there is nothing that prevents another client from writing data that does not conform to the schema defined by a well-behaved producer. Broker-side validation is necessary to implement centralized data governance policies.

Various data governance products have been launched that enable the Kafka broker to do some schema validation, however these features are limited in their utility because they can only validate that the schema ID matches the schema ID from the Schema Registry, not that the schema of the record actually matches the expected schema.

Announcing WarpStream Schema Validation

WarpStream is excited to announce that users can now connect WarpStream Agents to any Kafka-compatible Schema Registry and validate that records conform to the provided schema!


The WarpStream Agent connects to an external Schema Registry

WarpStream Schema Validation validates not only that the schema ID encoded in a given record matches the schema ID in the Schema Registry but also that the record actually conforms with the provided schema. In addition, WarpStream Schema Validation adds a “warning-only” configuration property, which, when enabled, emits a metric to identify rejected records instead of rejecting them, providing easier testing and monitoring during schema migrations without implementing separate dead-letter queues or risking data loss. WarpStream Schema Validation is built into the WarpStream Agent, so the Agent does this validation in the customer’s environment, and no data is exfiltrated.

To connect the WarpStream Agents with a Schema Registry, specify the optional -schemaRegistryURL flag in the Agent configuration. WarpStream supports Basic, TLS, and mTLS authentication between the Agent and the Schema Registry.

Once the Agents are connected to a compatible Schema Registry, WarpStream Schema Validation can be enabled with the following topic-level configurations:

Enabling record-level validation with an external schema registry increases the CPU load for the Agents. But unlike Kafka brokers, which cannot be auto-scaled without significant operational toil and risk of data loss, WarpStream Agents are completely stateless and can be scaled elastically based on basic parameters like CPU utilization. This means that, unlike Kafka, a WarpStream cluster can be scaled automatically on the fly, so there’s no need to permanently provision more Agents in anticipation of increased CPU utilization.

In addition, using Agent Roles, WarpStream can isolate parts of a workload to a specific set of Agents, which reduces the impact of increased load caused by Schema Validation. Schema Validation uses the proxy-produce role, so Agents handling Produce() requests can be isolated from the rest of the cluster and scaled independently.

Currently, WarpStream supports validating JSON and Avro schemas, with support for Protobuf coming soon. While the current implementation of WarpStream Schema Validation utilizes external Schema Registries, we are also currently working on building our own WarpStream-native schema registry.

To learn more about WarpStream Schema Validation, read the docs, or contact us.

Create a free WarpStream account and start streaming with $400 in free credits. Get Started!

apachekafka Article's
30 articles in total
Favicon
Mastering Apache Kafka: A Complete Guide to the Heart of Real-Time Data Streaming
Favicon
AIM Weekly for 11/11/2024
Favicon
Apache Kafka: A Simple Guide to Messaging and Streaming
Favicon
Design a real-time data processing
Favicon
Building a Scalable Data Pipeline with Apache Kafka
Favicon
Building a Scalable Data Pipeline with Apache Kafka
Favicon
Implementing AI with Scikit-Learn and Kafka: A Complete Guide
Favicon
Understanding the Importance of Kafka in High-Volume Data Environments
Favicon
How can i stop my kafka consumer from consuming messages ?
Favicon
Getting Started with Apache Kafka: A Beginner's Guide to Distributed Event Streaming
Favicon
🚀 Apache Kafka Cluster Explained: Core Concepts and Architectures 🌐
Favicon
WarpStream Newsletter #5: Dealing with Rejection, Schema Validation, and Time Lag
Favicon
Dealing with rejection (in distributed systems)
Favicon
Apache Kafka on Amazon Linux EC2
Favicon
Announcing WarpStream Schema Validation
Favicon
The Kafka Metric You’re Not Using: Stop Counting Messages, Start Measuring Time
Favicon
WarpStream Newsletter #4: Data Pipelines, Zero Disks, BYOC and More
Favicon
Integrating Apache Kafka with Apache AGE for Real-Time Graph Processing
Favicon
Integrating Apache Kafka with Apache AGE for Real-Time Graph Processing
Favicon
Multiple Regions, Single Pane of Glass
Favicon
FLaNK-AIM: 20 May 2024 Weekly
Favicon
Secure by default: How WarpStream’s BYOC deployment model secures the most sensitive workloads
Favicon
Zero Disks is Better (for Kafka)
Favicon
FLaNK AI-April 22, 2024
Favicon
Pixel Federation Powers Mobile Analytics Platform with WarpStream, saves 83% over MSK
Favicon
FLaNK AI - 15 April 2024
Favicon
WarpStream Newsletter #3: Always Be Shipping
Favicon
Introducing WarpStream Managed Data Pipelines for BYOC Clusters
Favicon
Apache Kafka
Favicon
FLaNK-AIM Weekly 06 May 2024

Featured ones: