Batch Processing for Efficiency¶
Apache Kafka® has been optimized for efficiency because efficiency is key component of effective multi-tenant operations.
For example, a primary use case for Kafka is the handling web activity data, which is very high volume; each page view may produce dozens of message writes. Each of these published messages is read by one or or more consumers, so consumption should be as as efficient as possible.
In a multi-tenant system, a small bump in application usage can result in a bottleneck in the downstream infrastructure service.
To help with this issue, Kafka is designed to be very fast. It is more likely that an application will tip-over under load before the Kafka infrastructure does.
There are two common causes of inefficiency in a multi-tenant system:
- Excessive byte copying
- Too many small I/O operations
The next sections describe how Kafka avoids the inefficiencies.
Avoids excessive byte copying¶
Excessive byte-copying results in inefficiency. At low message rates, excessive byte copying is not an issue, but under load, the impact is significant. To avoid this, Kafka employs a standardized binary message format that is shared by the producer, broker, and the consumer so that data chunks are transferred without modification.
In contrast to a typical system where the operating system and application are doing multiple reads and writes, before data is sent over the network, Kafka expects multiple consumers for a topic. Using the zero-copy optimization offered by modern Unix and Linux operating sytems with the sendfile system call, data is copied into the pagecache exactly once and reused on each consumption instead of being stored in memory and copied out to user-space every time it is read. This enables messages to be consumed at a rate that approaches the limit of the network connection.
Using this combination of pagecache and
sendfile means that on a Kafka cluster where the
consumers are mostly caught up you will see no read activity on the disks as they are serving
data entirely from cache.
Batches I/O operations¶
Too many small I/O operations can occur both between the client and the server and in the server’s own persistent operations. To avoid this, the Kafka protocol is built around a “message set” abstraction that naturally groups messages together. This allows network requests to group messages together and amortize the overhead of the network roundtrip rather than sending a single message at a time. The server in turn appends chunks of messages to its log in one go, and the consumer fetches large linear chunks at a time.
This simple optimization increases speed by orders of magnitude. Batching leads to larger network packets, larger sequential disk operations, contiguous memory blocks, and more, all of which enables Kafka to turn a bursty stream of random message writes into linear writes that flow to consumers.
End-to-end batch compression¶
In some cases the bottleneck is actually not CPU or disk, but network bandwidth. This is particularly true for a data pipeline that needs to send messages between data centers over a wide-area network. Of course, the user can always compress their messages, one at a time, without support from Kafka. However, this can lead to poor compression ratios because of repetition between messages of the same type. For example, field names in JSON or user agents in web logs. Efficient compression requires compressing batches of messages together rather than compressing each message individually.
Kafka supports the compression of batches of messages with an efficient batching format. A batch of messages can be compressed and sent to the server, and this batch of messagess is written in compressed form and will remain compressed in the log. The batch will only be decompressed by the consumer.
Kafka supports GZIP, Snappy, LZ4 and ZStandard compression protocols. More details can be found on the Compression topic on the Kafka wiki.