Kafka and the File System

Apache Kafka® relies heavily on the file system for storing and caching messages.

A key reason for this is that hard drive throughput has been diverging from the disk seek latency for the last decade. For example, a JBOD configuration with six 7200rpm SATA RAID-5 array:

  • Performs linear writes at about 600 MBps.
  • Performs random writes at about 100 Kbps, over 6000 times slower than linear writes

Linear reads and writes provide a predictable usage pattern, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes for these linear operations. To learn more, read the ACM Queue article . Of note, the article authors find that sequential disk access can in some cases be faster than random memory access.

Because of the performance divergence between throughput and disk seeks, modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature is hard to turn off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in the OS’s page cache, effectively storing everything twice.

Kafka and the JVM

Kafka is built on top of the JVM, and Java memory usage has the following characteristics:

  1. The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
  2. Java garbage collection becomes increasingly slow as the in-heap data increases.

As a result, using the file system and relying on the kernel page cache is better than an in-memory cache or other structure. The page cache provides at least double the available cache by providing automatic access to all free memory, and likely double again by storing a compact byte structure rather than objects.

For example, by using the file system, a 32 GB machine can have a cache of up to 28-30GB, without GC penalties. Furthermore, this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory, which takes a long time.

With Kafka, instead of maintaining as much as possible and flushing it to the file system when memory space runs out, all data is immediately written to a persistent log on the file system without flushing to disk. This means the data is essentially transferred into the kernel’s page cache. All logic for maintaining coherency between the cache and file system is in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts.

Persistent queue and constant time

A typical messaging system might use per-consumer queues with associated BTrees or similar data structures to maintain message metadata. BTrees are a versatile data structure, but they come with a fairly high cost; Btree operations are O(log N), and O(log N), which is typically considered equivalent to constant time.

In contrast, disk operations are slower. Disk seeks are typically 10 MS each, and each disk can do only one seek at a time, so parallelism is limited. This means that even 5-6 disk seeks leads to a lot of overhead. Since storage systems mix fast cached operations with slow physical disk operations, as data increases with a fixed cache, the performance of tree structures is often superlinear, meaning doubling your data makes things worse than twice as slow.

Instead, a persistent queue could be built on simple reads and appends to files like you typically see with logging solutions. Then all operations are O(1) and reads do not block writes or each other. This offers performance advantages because performance is decoupled from the data size. One server can use a number of cheap, low-rotational speed 1+TB SATA drives. They have poor seek performance, but these drives perform acceptably for large reads and writes and are inexpensive.

Kafka has access to virtually unlimited disk space without any performance penalty, which means that Kafka can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to delete messages as soon as they are consumed, Kafka can retain messages for a relatively long period (a week, for example), which enables flexibility for consumers.


This website includes content developed at the Apache Software Foundation under the terms of the Apache License v2.