How to Deliver Large Messages in Kafka (3 Approaches)

Introduction
Handling Large Messages in Kafka
Best Practices for Large Messages
Final Words

Introduction

Apache Kafka is a robust distributed event streaming platform that is widely used for building real-time data pipelines and streaming applications. It is designed to handle high volumes of data efficiently. However, when dealing with exceptionally large messages, the default Kafka configuration may not be adequate. In this guide, we will explore strategies and best practices for delivering large messages in Kafka.

Before diving into the solutions, it is important to understand that Kafka has a default maximum message size cap which is governed by the message.max.bytes setting in the broker configuration and the max.request.size in the producer configuration. It is typically around 1MB to avoid the overhead that comes with large payloads. Increasing these limits can lead to increased memory requirements and additional latency.

Handling Large Messages in Kafka

To deal with large messages, we can employ various methods:

Message Compression

Producers can compress messages before they’re sent to Kafka. Compression saves bandwidth and storage, however, all consumers must know how to decompress the messages. Kafka supports gzip, snappy, lz4, and zstd compression types out of the box.

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
props.put("compression.type", "gzip"); // Options: 'none', 'gzip', 'snappy', 'lz4', 'zstd'

Producer<String, byte[]> producer = new KafkaProducer<>(props);

Segmentation and Reassembly

For messages that exceed the network or broker’s cap, you can break down the message into smaller parts, and then reassemble them on the consumer side. This process requires additional logic in both the producer and the consumer to handle message segmentation and reassembly.

Here is a simple example of producing a segmented message:

byte[] largeMessage = ... // your large message
int segmentSize = ... // segment size less than 'message.max.bytes'
int totalSegments = (int)Math.ceil(largeMessage.length / (double)segmentSize);

for(int i = 0; i < totalSegments; i++) {
    int start = i * segmentSize;
    int end = Math.min(start + segmentSize, largeMessage.length);
    byte[] segment = Arrays.copyOfRange(largeMessage, start, end);
    ProducerRecord<String, byte[]> record = new ProducerRecord<>("your_topic", null, segment);
    producer.send(record);
}

The code above splits a large message into segments and sends each as an individual Kafka record. The consumer would accordingly collect these segments and reconstruct the original message.

External Storage Reference

Another solution is to store the message in an external storage system (e.g., Amazon S3, HDFS) and pass the reference (e.g., URI) to it in a Kafka message allowing consumers to fetch it when they read the message.

String messageUri = uploadToExternalStorage(largeMessage); // Implement this method
ProducerRecord<String, String> record = new ProducerRecord<>("your_topic", null, messageUri);
producer.send(record);

This approach ensures that the Kafka ecosystem is not strained with large payloads and the storage backend used for large objects can be scaled separately.

Best Practices for Large Messages

While it’s possible to adjust Kafka to support larger messages, it’s crucial to consider the trade-offs and complexities:

Monitor performance as increasing message size can have an adverse impact on the throughput and latency.
Be vigilant about broker and consumer memory usage as larger messages can lead to increased memory pressure.
Keep the consumer group lag in check because processing larger messages typically takes longer.
When storing large messages in external storage, manage the lifecycle of these objects to prevent ever-growing storage costs.

Final Words

In conclusion, while Kafka has a default maximum message capacity, it provides mechanisms to work with larger messages. Whether by segmenting messages, utilizing external storage, or configuring broker and producer limits, Kafka adapts to various use cases. However, these techniques come with their own set of challenges and should be implemented carefully, keeping the best practices in mind.

It’s always best to conduct performance testing with your specific use case to ensure that Kafka’s configuration aligns with your application’s needs, and provide the best possible performance and reliability.

Next Article: Understanding session.timeout.ms in Kafka (through examples)

Previous Article: How to Connect to Kafka Running in Docker From Outside Docker

Series: Apache Kafka Tutorials

DevOps