Kafka: a Distributed Messaging System for Log Processing

Notation Legend

#something : number of something
: excerpt from paper
() : an opinion or some information that’s not present in paper

Pre-Read Thoughts

Kafka is pretty hyped-up tool for it’s throughput, at least for me (way before I know distributed systems and all)

I know most high level APIs (as I worked with them) and some internals because of it’s popularity, so focusing more on Message Delivery Semantics if there’s any

To be noted that in current version, Zookeeper’s role (as coordinator) is deprecated and replaced with KRaft which is based on Raft (raft is taking on everything where there’s consensus!)

Introduction

Kafka is a distributed messaging system for collecting and delivering high volumes of log data with low latency

Log data ⇒ user activity (clicks, likes), operational & system metrics (latency, h/w)

Related Work

Kafka is different from traditional messaging sys.s

They’re offering features which degrades throughput like atomically delivering msgs and delivery acks, and hard to distribute

Kafka is pull based and not push based

Kafka Architecture and Design Principles

A stream of messages of a particular type is defined by a topic. A producer can publish messages to a topic. The published messages are then stored at a set of servers called brokers. A consumer can subscribe to one or more topics from the brokers, and consume the subscribed messages by pulling data from the brokers.

consumer simply creates a msg stream iterator which doesn’t end on a topic

topic can have multiple partitions in multiple brokers

image 6.png

Efficiency on a Single Partition

Simple Storage

image 1 2.png

each partition corresponds to a log which is set of segment files

for better perf, they batch messages and flush segment files, only then consumed

each msg can be ID’ed by offset

consumer consumes sequentially

broker keeps in-mem idx as shown in left with first message offset/id of segment file

Efficient Transfer

there is no explicit caching made by Kafka Broker, it just relies on OS to cache as page cache

it also helps when broker restarts since cache is at OS level and not app level

to send messages to consumers, it uses sendfile sys call which is efficient by directly sending file at socket level with no hops with application

Stateless Broker

broker doesn’t maintain any record of messages consumed by consumers

msgs have retention policy of 7 days (configurable), so msgs can be replayed incase of consumer app crash or some other event

Distributed Coordination

a producer can publish to partition randomly or by partition key or function

consumers are grouped under consumer group , where msgs consumed under this group are consumed by only one of the consumer

Kafka uses Zookeeper for the following tasks

  1. detecting the addition and the removal of brokers and consumers
  2. triggering a rebalance process in each consumer when the above events happen
  3. maintaining the consumption relationship and keeping track of the consumed offset of each partition.

The broker registry contains the broker’s host name and port, and the set of topics and partitions stored on it.

The consumer registry includes the consumer group to which a consumer belongs and the set of topics that it subscribes to.

Delivery Guarantees

at-least-once delivery is guaranteed, exactly-once requires 2PC and not effective for their use-case

But it’s suggested to handle de-duplication logic from application side

messages from single partition will always be delivered in order, no guarantee from other partitions

to check corruption in messages, CRC is used

Kafka Usage at LinkedIn & Experimental Results

image 2 2.png

from reference pdf

Post-Read Thoughts

Paper being very light, main content tend to be Kafka’s architecture & design principles with distributed coordination and delivery guarantees

I just realized that Kafka- a Distributed Messaging System for Log Processing is exactly what I heard 2 years back in ByteByteGo’s Why is Kafka Fast?

Current features which are interesting are future works to that time of writing

  • Message replication
  • Dead letter queue

Further Reading

Kafka Replication