Apache Kafka: How to Purge Topics and Messages

Introduction
Understanding Kafka Data Retention
Modifying Topic Configurations
Force Deletion of a Topic
Purging Records Within a Topic
Using Kafka Streams for Purging
Conclusion

Introduction

Apache Kafka is a powerful distributed event streaming platform that can handle high volumes of data. It’s widely used for building real-time streaming data pipelines and applications. Managing the data—specifically, purging topics and messages—can be an essential part of maintaining performance and ensuring compliance with data retention policies. Throughout this tutorial, we’ll explore the different methods for purging topics and messages in Apache Kafka, complete with code examples to guide you through the process.

Understanding Kafka Data Retention

Before purging data in Kafka, it’s important to understand how data retention works. Kafka stores records in topics which are divided into partitions. Each partition is an ordered, immutable sequence of records that is continually appended to. Records within a partition are assigned a unique offset. Kafka clusters store topics across multiple servers, ensuring fault tolerance and high availability.

Data is retained in Kafka based on two primary factors: time and size. You can configure retention policy per topic:

retention.ms: Determines how long messages should be retained based on time.
retention.bytes: Determines how many bytes of data to retain based on the total size of messages.

Modifying Topic Configurations

You can alter existing topic configurations to set retention policies:

$ bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name your-topic-name --alter --add-config 'retention.ms=1000'

Keep in mind that lowering these values does not immediately trigger data deletion. Kafka periodically runs a process that checks for data eligible for deletion.

Force Deletion of a Topic

In some cases, you may need to completely remove a topic and its data. This can be done using the Kafka command line tools:

$ bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic your-topic-name

Note that topic deletion must be enabled in the Kafka server settings (delete.topic.enable=true). After issuing the delete command, Kafka will mark the topic for deletion, and it will eventually be purged from all nodes in the cluster.

Purging Records Within a Topic

If you want to delete records within a topic without deleting the topic itself, you can use the log compaction feature in Kafka or alter retention configurations:

Log Compaction: You can enable log compaction on a topic which ensures that Kafka will save the last message for each key.
Retention Policies: By setting retention configuration to a lower value and then setting it back to the original value, you will force a retention check.

$ bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name your-topic-name --alter --add-config 'retention.ms=1000'

# Wait for a moment, then restore the original retention period
$ bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name your-topic-name --alter --delete-config 'retention.ms'

Using Kafka Streams for Purging

Another way to purge data from Kafka is by using Kafka Streams, the stream processing library of the Kafka ecosystem. Kafka Streams allows you to create applications that continually process records and can produce a new, “purged” version of a topic.

A simple Kafka Streams application that filters out records and writes to a new topic could look like this:

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("source-topic");
source.filter((key, value) -> value != null)
      .to("purged-topic");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));

Note: All code examples assume Kafka’s default installation directory, adjust paths accordingly for your setup. Always test these commands in a development environment before executing in production. Be cautious with data purging operations to avoid accidental data loss.

Conclusion

In this tutorial, you’ve learned how to manage Kafka data retention effectively, including modifying topic configurations and using Kafka Streams to purge data. Remember to carefully manage topic deletion and record purging, as these are irreversible operations. With the proper use of Kafka’s data management features, you can ensure that your streaming data platform remains efficient, organized, and in compliance with data retention requirements.

Next Article: Kafka: How to read records in JSON format

Previous Article: Kafka: How to set retention time for messages in a topic

Series: Apache Kafka Tutorials

DevOps