Apache Kafka has become the backbone of data processing for modern enterprises, providing a high-throughput, fault-tolerant distributed messaging system. It enables the development of real-time data pipelines and streaming applications. Kafka works on the concept of distributed log partitions where each message in a partition is assigned a unique sequential ID called an offset. Managing offsets correctly is crucial for processing messages in Kafka, as it determines what has been consumed and what remains to be processed.

In this tutorial, we’re going to look at how to work with Kafka offsets. We will cover from basics to some advanced concepts with practical code examples using Kafka’s Consumer API.

Understanding Kafka Offsets

Every Kafka message within a partition has an associated offset, which is a long value that indicates its position in the partition. Consumer groups track the offset of messages they have processed which helps ensure that every message is processed, and none are missed or duplicated.

Offsets are not removed when a message is consumed. Instead, Kafka maintains a committed offset for each consumer group. When a consumer in a group has processed messages up to a certain point, it should commit the offsets to Kafka. This helps at system failures as the consumer can resume from the last committed offset.

Here’s how you can read messages and commit offsets manually:

from kafka import KafkaConsumer

# Create a Kafka consumer
consumer = KafkaConsumer(
    'topic-name',
    group_id='my-group',
    bootstrap_servers=['localhost:9092'],
    enable_auto_commit=False  # Disable auto-commit
)

# Read and process messages
for message in consumer:
    process_message(message)
    # Committing the offset
    consumer.commit()

# process_message could be any function that processes your messages. The consumer.commit() line makes sure your offset gets committed after processing the message.

Auto-Committing Offsets

By default, Kafka automatically commits offsets at a configurable interval. This behaviour is controlled by the enable_auto_commit flag which is set to True by default, and by the auto_commit_interval_ms setting.

Here’s an example:

from kafka import KafkaConsumer

# Create a Kafka consumer with auto commit
consumer = KafkaConsumer(
    'topic-name',
    group_id='my-auto-group',
    bootstrap_servers=['localhost:9092'],
    enable_auto_commit=True,  # Enable auto-commit
    auto_commit_interval_ms=1000  # Commit every second
)

# Read and process messages with auto-commit enabled
for message in consumer:
    process_message(message)

While this is convenient, it’s important to understand that it can lead to duplicates or lost messages if your application crashes in-between polls.

Advanced Offset Management

Advanced users may want more control over offsets, for synchronization or exactly-once message processing scenarios.

To seek to a specific offset:

from kafka import KafkaConsumer, TopicPartition

# Create a Kafka consumer
consumer = KafkaConsumer(
    bootstrap_servers=['localhost:9092'],
    group_id='my-advanced-group',
    auto_offset_reset='earliest'  # Start from the beginning if no offset is stored
)

# Assign a topic and a partition
consumer.assign([TopicPartition('topic-name', 0)])

# Seek to a specific offset (e.g., offset 10)
consumer.seek(TopicPartition('topic-name', 0), 10)

# Now consume from offset 10
for message in consumer:
    process_message(message)

You might also use seek_to_beginning or seek_to_end to quickly go to the beginning or end of the partition.

If you want to commit offsets with additional metadata:

...  # Assume consumer instance already created and subscribed

# Process messages and commit offsets with metadata
for message in consumer:
    process_message(message)
    # Commit offset with metadata
    metadata = {'processed_date': datetime.now().isoformat()}
    consumer.commit({TopicPartition('topic-name', message.partition): OffsetAndMetadata(message.offset+1, json.dumps(metadata))})

Storing metadata can be helpful for debugging or understanding your consumers’ processing state.

Conclusion

In this tutorial, we’ve explored the fundamentals of managing offsets in Kafka. We discussed the convention of message consumption, and the difference between manual and automatic committing. We also delved into advanced offset controls, which allows for nuanced control over consumer behavior and offers improved robustness for your Kafka applications. Understanding and efficiently managing offsets can be the key to maintaining data integrity and consistency in distributed systems. By mastering these offset patterns, you set a strong foundation for reliable message processing within Kafka.

Next Article: How to install and configure Apache Kafka on Windows

Previous Article: Building a Basic Kafka Consumer in Java

Series: Apache Kafka Tutorials

DevOps