How to Achieve High Availability in Kafka (with Examples)

Updated: January 31, 2024 By: Guest Contributor Post a comment

Introduction

Apache Kafka is a distributed streaming platform that enables you to build real-time data pipelines and applications. At the heart of its design is the ability to handle high volumes of data and provide high availability and resilience to node and network failures. This feature is critical for systems that require constant uptime and can’t afford to lose data. In this article, we’ll explore how to achieve high availability in Kafka with practical examples and best practices.

Understanding Kafka’s Architecture

To understand high availability in Kafka, you need to know its basic components:

  • Broker: A server in a Kafka cluster responsible for storing data and serving consumer requests.
  • ZooKeeper: A centralized service for maintaining configuration information, naming, and synchronization for distributed applications, including Kafka.
  • Topic: A category or feed name to which messages are published.
  • Partition: A division within a topic. Each partition can be replicated across a set of brokers.
  • Replica: A copy of a partition. Kafka replicates partitions to enable high availability.

Replication Factor

The replication factor defines the number of copies (replicas) that Kafka makes of each partition within a topic. A higher replication factor increases the availability and fault tolerance of your Kafka system. Below is an example of how to set the replication factor when creating a topic:

bin/kafka-topics.sh --create \
    --zookeeper zookeeper1:2181,zookeeper2:2181,zookeeper3:2181 \
    --replication-factor 3 \
    --partitions 6 \
    --topic my-high-availability-topic

This command creates a topic named my-high-availability-topic with a replication factor of 3, making it resilient to the failure of up to two brokers.

Brokers and High Availability

For Kafka to be highly available, it must be resilient to individual broker failures. By setting the proper number of brokers and designing your cluster with failure in mind, you can achieve resilience. To configure broker properties for high availability, you will edit the server.properties file, usually located in the Kafka config directory. Here’s an example setting that affects availability:

min.insync.replicas=2

This setting enforces that at least two replicas must be in sync for the producer to acknowledge a write request. This ensures that even if a broker fails, another one can serve the data without loss.

ZooKeeper and Quorum

Your Kafka cluster relies on ZooKeeper for configuration management and coordination. For high availability, you need to set up a ZooKeeper ensemble – a group of ZooKeeper servers that communicate with each other. A quorum (a majority of ensemble members) must be operational for the ZooKeeper ensemble to function. Therefor the size of the ensemble impacts its availability. The general advice is to use an odd number of servers (usually three or five) for the ensemble to handle the failure of a certain number of servers while still maintaining a quorum.

Producer and Consumer Configuration

Producers are clients that write data to Kafka. Producers can be configured for reliability using the acks parameter. Here’s an example:

Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);

Setting acks to all ensures that the producer waits for acknowledgments from all in-sync replicas. This can reduce throughput but increases data durability.

Similarly, consumers – clients that read data from Kafka – can benefit from configuration settings geared toward high availability. While there’s less consumer configuration to manage in terms of availability, setting up your consumer to handle rebalances and to commit offsets correctly is important.

Maintenance and Monitoring

A high availability system also requires diligent maintenance and monitoring. Kafka tools like kafka-reassign-partitions.sh can be used to manually alter the topic partition layout and replica assignment in the cluster which is useful during maintenance work, such as taking a broker down for upgrades.

For monitoring, Kafka integrates with JMX for operations metrics, which can be visualized using tools like Prometheus and Grafana. You’ll want to monitor key metrics such as under-replicated partitions, active controller count, and request timings, which are indicative of the health and performance of your Kafka cluster.

Conclusion

In this tutorial, we explored different strategies and configurations to achieve high availability with Apache Kafka. By carefully planning your cluster setup, configuring producers and consumers correctly, diligently monitoring, and performing routine maintenance, you can create a robust Kafka system that serves your needs for real-time data with minimal downtime. It’s essential to understand that achieving high availability is a continual process of monitoring, tuning, and adjusting your system’s architecture as your needs evolve.