An Introduction to Apache Kafka & Event Streaming

What is Apache Kafka?
Core Components of Apache Kafka
Setting Up Apache Kafka
Basic Producer Example
Basic Consumer Example
Kafka Streams
Advanced Concepts
Conclusion

What is Apache Kafka?

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and subsequently open-sourced in early 2011. It is written in Scala and Java. Apache Kafka is a framework implementation of a software bus using stream-processing. It is designed to allow for high-throughput, low-latency processing of real-time data feeds. Its design pattern is mainly based on the transaction log of databases. It uses a publish-subscribe system that allows you to send messages between processes, applications, and servers.

Kafka is widely used for building real-time streaming data pipelines that reliably get data between systems or applications. It is also used for building real-time streaming applications that transform or react to the streams of data. The platform provides a high-level of durability, scalability, and fault-tolerance by maintaining streams of records in a distributed cluster.

Core Components of Apache Kafka

Producer: Responsible for publishing records to Kafka topics.
Consumer: Reads records from topics.
Broker: Kafka is run as a cluster on one or more servers that can span multiple datacenters. The Kafka cluster stores streams of records in categories called topics.
Topic: A log of messages. Topics are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
Partition: Kafka topics are divided into a number of partitions, which contain records in an unchangeable sequence.

Kafka’s ability to rewrite the chronicled events make it stand apart from other messaging systems. It serves as a kind of external commit-log for a distributed system. It’s an interesting blend of a messaging system and a log aggregation tool, and can even be used as a database.

Setting Up Apache Kafka

The first step in using Apache Kafka is setting it up. Apache Kafka requires a ZooKeeper service, so you ‘ll need to start by setting up ZooKeeper if you don’t have that running already.

# Download the latest Kafka release and extract it:
wget http://apache.mirrors.hoobly.com/kafka/2.6.0/kafka_2.13-2.6.0.tgz
tar -xzf kafka_2.13-2.6.0.tgz
cd kafka_2.13-2.6.0

# Start the ZooKeeper service
bin/zookeeper-server-start.sh config/zookeeper.properties

# Now start the Kafka server:
bin/kafka-server-start.sh config/server.properties

Kafka comes with built-in examples for producers and consumers. However, for a better understanding of how to implement these components, it’s useful to see them in action through comprehensive examples.

Basic Producer Example

A producer sends messages to Kafka topics. The following code snippet shows a very basic example of a producer using Kafka’s Java API.

import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Properties;

public class SimpleProducer {
    public static void main(String[] args) throws Exception{
        String topicName = args[0];
        String key = "Key1";
        String value = "Value-1";
        
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer",    
            "org.apache.kafka.common.serialization.StringSerializer");            
        props.put("value.serializer", 
            "org.apache.kafka.common.serialization.StringSerializer");
        
        Producer<String, String> producer = new KafkaProducer    
            <String, String>(props);
                    
        ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value);
        producer.send(record);
        producer.close();
        
        System.out.println("SimpleProducer Completed.");
    }
}

This code connects to a Kafka cluster specified by “bootstrap.servers”, and sends a single message to the specified topic. The message has a key “Key1” and a value “Value-1”.

Basic Consumer Example

A consumer reads messages from Kafka topics. The following example shows a simple consumer:

import java.time.Duration;
import java.util.Collections;
import java.util.Properties;
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

public class SimpleConsumer {
    public static void main(String[] args) {
        String topicName = args[0];

        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "localhost:9092");
        props.setProperty("group.id", "test");
        props.setProperty("key.deserializer", 
            "org.apache.kafka.common.serialization.StringDeserializer");
        props.setProperty("value.deserializer", 
            "org.apache.kafka.common.serialization.StringDeserializer");

        Consumer<String, String> consumer = new KafkaConsumer<>(props);

        consumer.subscribe(Collections.singletonList(topicName));

        try {
            while (true) {
                ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
                records.forEach(record ->
                    System.out.printf("Consumer Record:(%s, %s, %d, %d)
",
                    record.key(), record.value(),
                    record.partition(), record.offset()));
            }
        } finally {
            consumer.close();
        }
    }
}

This simple Kafka consumer subscribes to the provided topic and continuously polls for new records, printing them out to the console.

Kafka Streams

Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology. Here is a basic word count example using Kafka Streams:

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;

import java.util.Arrays;
import java.util.Properties;
import java.util.regex.Pattern;

public class WordCountApplication {

    public static void main(final String[] args) {
        Properties config = new Properties();
        config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application");
        config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> textLines = builder.stream("TextLinesTopic");
        Pattern pattern = Pattern.compile("\W+", Pattern.UNICODE_CHARACTER_CLASS);

        KStream counts = textLines
            .flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))
            .groupBy((key, word) -> word)
            .count()
            .toStream();
        counts.to("WordsWithCountsTopic");

        KafkaStreams streams = new KafkaStreams(builder.build(), config);
        streams.start();

        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
    }
}

This simple processor reads from a source topic “TextLinesTopic”, breaks the text lines into words, computes the word occurrence frequency, and writes these counts to a sink topic, “WordsWithCountsTopic” using Kafka Streams API.

Advanced Concepts

As you dive deeper into Apache Kafka, you may encounter concepts like exactly-once semantics, Kafka Connect, and the KSQL query language. These advanced features take Kafka’s capabilities to new levels, enabling robust, scalable, and fault-tolerant stream processing.

Conclusion

Apache Kafka stands out as the unified platform that has integrated well with big data and stream processing technologies. It fulfills the demands for high-throughput and low-latency messaging systems, while providing durability and fault tolerance. With this primer, you should now have a solid foundation to embark on your Kafka journey and dig deeper into its capabilities.

Next Article: How to install and configure Apache Kafka on Windows

Series: Apache Kafka Tutorials

DevOps