What is Apache Kafka?

Data is everywhere. Enterprises create data from time to time. Each byte of data produced has a story to tell. In order to get details of the story, we need to get the data from where it is created to where it can be analyzed. It is believed that the less effort we can spend moving the data around, the more we can focus on the core business at hand.

Kafka was first developed within the premises of LinkedIn by the founders of confluent. Apache Kafka powers 80% of Fortune 100 companies to fuel real-time, data-driven backend operations and rich front-end experiences.

Any time scientists disagree, it’s because we have insufficient data. Then we can agree on what kind of data to get; we get the data; and data solves the problem. Either I’m right, or you are right, or we both are wrong. And, we move on.

Neil deGrasse Tyson

Publish/Subscribe Messaging

Publish Messaging is a pattern characterized by the sender, the publisher, not with the intention to direct it to the receiver. Instead, the publisher classifies the messages on a central broker with the intention that the receiver, the subscriber, subscribes to receive certain classes of messages.

Have you ever wondered how different metrics publishing layers interact with each other? How do they interact in a typical direct connection between different layers? Have a look.

The connections are harder to trace where the publishers are having direct connections with the receivers.

What if a messaging publisher/subscriber system is introduced to the above diagram? It will be a system broker which will take care of the published messages waiting to be subscribed by the receiver whenever there is a need. Have a look at the diagram below.

Kafka Producers and Kafka Consumers

You can use Kafka as a message bus, queue or data storage platform but you will always use Kafka by writing data to Kafka, the producers, and read data from Kafka, the consumers, or an application that serves both roles.

Consider the following example to understand the definition in a more appropriate way.

Let us imagine a Credit Card processing system. There will be a client application, say an online store, responsible for sending each transaction to Kafka immediately when a payment is made. Now let us consider, that there will be another application which is responsible for immediately validating whether the transaction is approved or denied as per the rules engine defined. The approve/deny response can be written back to Kafka which in turn can be transited back to the online store where the transaction was initiated. A third application can read both transactions and approval status from Kafka and store them in a database where analysts can later review and perhaps improve the rules engines.

Kafka Producers – Deep Dive

The below-mentioned are the few use cases for which the application might need to write data to Kafka.

  1. Recording user activities for auditing analysis.
  2. Recording Metrics
  3. Storing log messages
  4. Recording information from smart appliances.
  5. Communicating asynchronously with other applications.
  6. Buffering information before writing to the database.

Let us look at a high-level overview of Producer components via a diagrammatic approach.

A ProducerRecord, which includes topics we want to send the record to and a value, to start producing Kafka messages. Once the ProducerRecord is sent, the first thing the Producer will do is serialize the [Key] and value objects to ByteArrays so they can be sent over the network.

Once the data is present in the Serializer, the next step is to send the data to Partitioner. If we specified a partition in ProducerRecord, the partitioner does nothing and simply returns the partition we specified. If we didn’t, the partitioner will choose the partition for us, usually based on the ProducerRecord [Key]. Once the partition is selected, the Producer is aware of the Topic and Partition the data will be moved to. A separate thread is responsible for sending the records to the appropriate Kafka Brokers.

When the Broker receives the message, it sends back a response. If the messages were successfully written to Kafka, it will return a RecordMetadata object with the topic, partition and offset of the record within the partition. If the broker failed to write the messages, it will return an error. When the producer receives an error, it may retry sending the message a few more times before giving up and returning an error.

Kafka Consumers – Deep Dive

Applications that need to read data from Kafka use a KafkaConsumer to subscribe messages to topics and receive messages from this topic.

In order to understand how to read data from Kafka, we must first be aware of the two most important concepts i.e., Consumers and Consumer Groups.

Consumers and Consumer Groups

Just imagine this for a while and address the question thereafter. We have an application that needs to read messages from Kafka. To achieve this, our application creates a consumer object, subscribes to the topic, then starts reading messages and concludes with writing the results. The question here is what will happen if the rate at which the producers write messages to topics exceeds the rate at which our application validates them? Think.

If we are limited to a single consumer reading and validating messages, our application will fall and crash, unable to cope with the incoming messages. If you are thinking of suggesting there is a need to scale consumption from topics, then I would say you are on right track. Remember in Kafka, multiple producers can write on the same topic. In a similar fashion, we need to allow multiple consumers to read from the same topic, splitting the data between them. Kafka Consumers are typically part of the Consumer Group.

Now, have a look at the below diagram. Let us take topic T1 with four partitions. Now suppose, we created a new consumer C1, which is the only consumer in Group G1, and use it to subscribe to topic T1. C1 will receive all messages from all four partitions of T1.

If we add another Consumer C2, to group G1, each consumer will receive a message from two partitions each. Perhaps messages from P0 and P2 to C1 and P1 and P3 to C2.

If we add 2 more additional consumers C3 and C4, to group G1, each consumer will receive a message from their respective partition. P0 —> C1 to P4 —> C4.

The main way we scale data consumption from a Kafka topic is by adding more consumers to a consumer group. It is common for Kafka consumers to do high-latency operations such as writing to a database or a time-consuming computation on the data.

Setting up Kafka and its practical understanding

For setting up Kafka on your system and understanding the above concepts, go through the following roadmap below.

1. Download the Apache Kafka binaries file from here.

2. Extract the binaries in your destined folder. Preferred C:\ drive.

3. Open a command prompt and hit the following commands.

– The first thing we need is to start the zookeeper service.

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties

Zookeeper services will start on localhost:2181

– Then, we need to start the Kafka Services

.\bin\windows\kafka-server-start.bat .\config\server.properties

Kafka Services will connect with Zookeeper services and it will be up on localhost.

– Our services are up and running. We can create a topic to which the messages will be published and subscribed. Here, we are creating 1 topic with 1 partition only.

.\bin\windows\kafka-topics.bat --create --replication-factor 1 --partitions 1 --topic Topic1 --bootstrap-server localhost:9092

This will create the topic.

– All our prerequisites are set up to write messages on the topic. We will start KafkaProducer to write messages.

.\bin\windows\kafka-console-producer.bat --broker-list localhost:9092 --topic Topic1

The following messages are written on the Topic1

– The final step is KafkaConsumer will consume message from the Topic1.

.\bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic Topic1 --from-beginning

The messages which we published on Topic1 will be subscribed by the KafkaConsumer to display to us.

Why Kafka?

Kafka is a de facto technology developers and architects use to build a new generation of scalable, streaming applications that empowers real-time data. Below is the main reasons why Kafka is popular.

1. High Throughput

2. High Scalability

3. Low Latency

4. Permanent Storage

5. High Availability

In the next episode, I will be attempting to discuss how performance engineering practices could ramp up the digital transformation efforts of an organization. Till then, if you found reading this piece of writing, do share it with your network using the below buttons.

Leave a Reply

Discover more from the scalable guy

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from the scalable guy

Subscribe now to keep reading and get access to the full archive.

Continue reading