Sling Academy
Home/DevOps/Apache Kafka: How to Purge Topics and Messages

Apache Kafka: How to Purge Topics and Messages

Last updated: January 31, 2024

Introduction

Apache Kafka is a powerful distributed event streaming platform that can handle high volumes of data. It’s widely used for building real-time streaming data pipelines and applications. Managing the data—specifically, purging topics and messages—can be an essential part of maintaining performance and ensuring compliance with data retention policies. Throughout this tutorial, we’ll explore the different methods for purging topics and messages in Apache Kafka, complete with code examples to guide you through the process.

Understanding Kafka Data Retention

Before purging data in Kafka, it’s important to understand how data retention works. Kafka stores records in topics which are divided into partitions. Each partition is an ordered, immutable sequence of records that is continually appended to. Records within a partition are assigned a unique offset. Kafka clusters store topics across multiple servers, ensuring fault tolerance and high availability.

Data is retained in Kafka based on two primary factors: time and size. You can configure retention policy per topic:

  • retention.ms: Determines how long messages should be retained based on time.
  • retention.bytes: Determines how many bytes of data to retain based on the total size of messages.

Modifying Topic Configurations

You can alter existing topic configurations to set retention policies:

$ bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name your-topic-name --alter --add-config 'retention.ms=1000'

Keep in mind that lowering these values does not immediately trigger data deletion. Kafka periodically runs a process that checks for data eligible for deletion.

Force Deletion of a Topic

In some cases, you may need to completely remove a topic and its data. This can be done using the Kafka command line tools:

$ bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic your-topic-name

Note that topic deletion must be enabled in the Kafka server settings (delete.topic.enable=true). After issuing the delete command, Kafka will mark the topic for deletion, and it will eventually be purged from all nodes in the cluster.

Purging Records Within a Topic

If you want to delete records within a topic without deleting the topic itself, you can use the log compaction feature in Kafka or alter retention configurations:

  • Log Compaction: You can enable log compaction on a topic which ensures that Kafka will save the last message for each key.
  • Retention Policies: By setting retention configuration to a lower value and then setting it back to the original value, you will force a retention check.
$ bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name your-topic-name --alter --add-config 'retention.ms=1000'

# Wait for a moment, then restore the original retention period
$ bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name your-topic-name --alter --delete-config 'retention.ms'

Using Kafka Streams for Purging

Another way to purge data from Kafka is by using Kafka Streams, the stream processing library of the Kafka ecosystem. Kafka Streams allows you to create applications that continually process records and can produce a new, “purged” version of a topic.

A simple Kafka Streams application that filters out records and writes to a new topic could look like this:

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("source-topic");
source.filter((key, value) -> value != null)
      .to("purged-topic");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));

Note: All code examples assume Kafka’s default installation directory, adjust paths accordingly for your setup. Always test these commands in a development environment before executing in production. Be cautious with data purging operations to avoid accidental data loss.

Conclusion

In this tutorial, you’ve learned how to manage Kafka data retention effectively, including modifying topic configurations and using Kafka Streams to purge data. Remember to carefully manage topic deletion and record purging, as these are irreversible operations. With the proper use of Kafka’s data management features, you can ensure that your streaming data platform remains efficient, organized, and in compliance with data retention requirements.

Next Article: Kafka: How to read records in JSON format

Previous Article: Kafka: How to set retention time for messages in a topic

Series: Apache Kafka Tutorials

DevOps

You May Also Like

  • How to reset Ubuntu to factory settings (4 approaches)
  • Making GET requests with cURL: A practical guide (with examples)
  • Git: What is .DS_Store and should you ignore it?
  • NGINX underscores_in_headers: Explained with examples
  • How to use Jenkins CI with private GitHub repositories
  • Terraform: Understanding State and State Files (with Examples)
  • SHA1, SHA256, and SHA512 in Terraform: A Practical Guide
  • CSRF Protection in Jenkins: An In-depth Guide (with examples)
  • Terraform: How to Merge 2 Maps
  • Terraform: How to extract filename/extension from a path
  • JSON encoding/decoding in Terraform: Explained with examples
  • Sorting Lists in Terraform: A Practical Guide
  • Terraform: How to trigger a Lambda function on resource creation
  • How to use Terraform templates
  • Understanding terraform_remote_state data source: Explained with examples
  • Jenkins Authorization: A Practical Guide (with examples)
  • Solving Jenkins Pipeline NotSerializableException: groovy.json.internal.LazyMap
  • Understanding Artifacts in Jenkins: A Practical Guide (with examples)
  • Using Jenkins with AWS EC2 and S3: A Practical Guide