Understanding Topics and Partitions in Apache Kafka (with Examples)

Updated: January 30, 2024 By: Guest Contributor Post a comment

Introduction

When it comes to building real-time streaming data pipelines, Apache Kafka emerges as the centralized service for handling high-throughput and low-latency messaging. At the heart of Kafka’s design lies the concept of topics and partitions, which are pivotal in understanding how Kafka maintains, distributes, and scales data. Here’s a two-faced guide – starting from the basics and venturing into advances uses with illustrative examples.

The Fundamentals of Kafka Topics

In Kafka, a topic is a named channel where producers publish data and from which consumers read. A Kafka cluster maintains logs for each topic’s messages. Topics are designed to be multi-subscriber; that is, they can be consumed by many consumers in real-time, making them suitable for broadcast-like scenarios.

Example: If we have an application that needs to send user-action events, each type of action (click, view, purchase) could be a separate topic.

Creating a Topic

Here’s how you can create a topic named ‘user-actions’ using the Kafka command line:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic user-actions

This command creates a topic with three partitions and a replication factor of one.

Understanding Partitions

Partitions are subsets of a topic’s logs. They allow topics to be parallelized by splitting the data across multiple nodes. Partitions are the unit of parallelism in Kafka; within a partition, messages are ordered and immutable.

Example: The ‘user-actions’ topic can be split into partitions based on the action type to allow multiple consumers to process in parallel.

Inspecting Partitions

You can use the following command to describe the partitions of a topic:

bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic user-actions

This command will output details of the ‘user-actions’ topic, including the number of partitions and their current leader brokers.

Configuring Partitions

The number of partitions in a topic can affect latency and throughput. More partitions allow greater parallelism and higher throughputs but also increase the latency and complexity. Hence, it’s crucial to strike the right balance in the production environment.

Adding Partitions

Suppose you need to scale the ‘user-actions’ topic as your application grows. Here is how to add more partitions:

bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic user-actions --partitions 6

This command increases the partition count for the ‘user-actions’ topic to six.

Replication Factor

Kafka also allows you to specify a replication factor to ensure data availability and fault tolerance. A replication factor of N indicates that messages must be replicated to N brokers in the cluster.

Here is how to modify the replication factor for resilience:

bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic user-actions --replication-factor 3

Data Writing and Reading Concepts

When writing to a partition, producers usually rely on a key to ensure that all messages for a particular key go to the same partition. Consumers, on the other hand, read from topics by subscribing to them. They can either read from the beginning, the end, or resume from where they last stopped.

Writing to Partitions

Here’s a simple Java snippet for a producer to send data to the ‘user-actions’ topic:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);

for (int i = 0; i < 100; i++) {
    producer.send(new ProducerRecord<String, String>("user-actions", "actionKey", "actionValue" + i));
}

producer.close();

Reading from Partitions

A Java example to consume data from the ‘user-actions’ topic is presented below:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("user-actions"));

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
    }
}

This code uses a consumer to subscribe to the ‘user-actions’ topic and output the messages it reads, showing the offset (unique position) of each message in the partition.

Conclusion

In conclusion, Kafka’s design around topics and partitions is critical for its ability to provide a high-performance and distributed streaming platform. By understanding these concepts and being able to put them into practice with examples as provided above, you’re well-equipped to tailor Kafka to fit your data stream processing needs.