Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks which allows Kafka to turn a bursty stream of random message writes into linear writes.
Kafka was developed by LinkedIn in 2010, and it has been a top-level Apache project since 2012. It is a highly scalable, durable, robust, and fault-tolerant publish-subscribe event streaming platform.
- Real-time processing of application activity tracking, like searches.
- Stream processing
- Log aggregation, where Kafka consolidates logs from multiple services (producers) and standardises the format for consumers.
- An interesting use case that has emerged is the microservices architecture. Kafka can be a suitable choice for event sourcing microservices where a lot of events are generated and we want to keep track of the sequence of events (i.e. what has happened).
Let’s talk a little about the basic components that Kafka uses for its publish-subscribe messaging system. A producer is an entity/application that publishes data to a Kafka cluster, which is made up of brokers. A broker is responsible for receiving and storing the data when a producer publishes. A consumer then consumes data from a broker at a specified offset, i.e. position.
That is, it’s a multi-producer, multi-consumer structure, and it looks something like this:
What does a basic unit of data look like in Kafka? This is generally called a message or a record (interchangeably). A message contains the data and also the metadata. The metadata contains information such as the offset, a timestamp, compression type, and etc.
These messages are organised into logical groupings or categories which are called a topic, to which producers publish data. Typically, messages in a topic are spread across different partitions in different brokers. A broker manages many partitions.
A producer can publish to multiple topics. You can define what your topics are and which topics a producer publishes to. In a similar vein, consumers can choose which topics they want to subscribe to as well. In some ways, this is similar to reading and writing to database tables.
A topic is then divided into partitions, where each contains a subset of a topic’s messages. A broker can have multiple partitions. Why are there multiple partitions for a topic? Primarily it is to increase throughput; parallel access to the topic can occur.
Further, the Kafka brokers also give us reliability and data protection using replication. If a broker fails, then all the partitions assigned to that broker would become unavailable.
To resolve this issue, there is the concept of a replica, i.e. a duplicate of each partition. You can specify the number of replicas a partition has. At a given point in time, all replicas are identical to the original partition — i.e. “leader” — unless it hasn’t caught up to the most recent data in the leader.
What is unique about Kafka is that it keeps all the messages for a set amount of time (this can be indefinitely). Each message has an offset, or position, in this message log. Instead of Kafka managing which message a consumer is up to, Kafka delegates this responsibility entirely to the consumer itself. By doing this, Kafka is able to support many more consumers.
-
Install the Java Development Kit (JDK): http://www.oracle.com/technetwork/java/javase/downloads/index.html
-
Download latest version from https://kafka.apache.org/downloads
-
Unpack latest version
tar -xzf cd
- Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
- Start Kafka Server
bin/kafka-server-start.sh config/server.properties
- For Windows
Open Kafka version folder you downloaded and extracted and create a new folder called - data. Add Zookeeper and Kafka folder.
Copy the path for zookeeper (C:\kafka_version\data\zookeeper) and put it in temp variable inside zookeeper.properties inside config folder (use notepad ++ to open it) and change forward slash to backward slash.
Then in terminal type: C:\Kafka_version id > zookeeper-server-start.bat config\zookeeper.properties
If all goes well then you see binding port 0.0.0.0:2181
Repeat it for kafka now by opening the xfile server.properties from config and change the path variable log.dirs to (C:\kafka_version\data\kafka)
Then in terminal type: C:\Kafka_version id > kafka-server-start.bat config\server.properties
This code is made for Unix-based systems such as Linux and Mac OS. For Windows use bin\windows\ instead of bin/, and change the script extension to .bat
- Run built-in script to create new topic named "test" with 1 partition on 1 node
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
- See the topic
bin/kafka-topics.sh --list --zookeeper localhost:2181
- Run the producer in terminal and enter some messages
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
message 1
message 2
message 3
- In a new terminal window read the messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
message 1
message 2
message 3
If you go side-by-side with the terminal windows you can type in the producer window and see the results appear in the consumer
- Work in progress. To be updated soon.
- https://www.youtube.com/watch?v=PzPXRmVHMxI
- https://medium.com/better-programming/an-introduction-to-apache-kafka-95a82260c1c3
- https://towardsdatascience.com/kafka-python-explained-in-10-lines-of-code-800e3e07dad1
- https://medium.com/swlh/apache-kafka-in-a-nutshell-5782b01d9ffb (Detailed Read)
- https://medium.com/hacking-talent/kafka-all-you-need-to-know-8c7251b49ad0
- http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#supported-queries