Kafka Streams – The Power without the Weight¶
Kafka Streams, a component of open source Apache Kafka, is a powerful, easy-to-use library for building highly scalable, fault-tolerant, distributed stream processing applications on top of Apache Kafka. It builds upon important concepts for stream processing such as properly distinguishing between event-time and processing-time, handling of late-arriving data, and efficient management of application state.
The following list highlights several key capabilities and aspects of Kafka Streams that make it a compelling choice for building use cases such as stream processing applications, continuous queries and transformations, and microservices.
- Highly scalable, elastic, fault-tolerant
- Stateful and stateless processing
- Event-time processing with windowing, joins, aggregations
- No dedicated cluster required
- No external dependencies
- “It’s a library, not a framework.”
- Fully integrated
- 100% compatible with Kafka 0.10.0.0
- Easy to integrate into existing applications
- No artificial rules for deploying applications
- Millisecond processing latency
- Does not micro-batch messages
- Windowing with out-of-order data
- Allows for arrival of late data
A closer look¶
Before we dive into the details such as the concepts and architecture of Kafka Streams or getting our feet wet by following the Kafka Streams quickstart guide, let us provide more context to the previous list of capabilities.
- Stream Processing Made Simple: Designed as a lightweight library in Apache Kafka, much like the Kafka producer and consumer client libraries. You can easily embed and integrate Kafka Streams into your own applications, which is a significant departure from framework-based stream processing tools that dictate many requirements upon you such as how you must package and “submit” processing jobs to their cluster.
- Has no external dependencies on systems other than Apache Kafka and can be used in any Java application. Read: You do not need to deploy and operate a separate cluster for your stream processing needs. Your Operations and Info Sec teams, among others, will surely be happy to hear this.
- Leverages Kafka as its internal messaging layer instead of (re)implementing a custom messaging layer like many other stream processing tools. Notably, Kafka Streams uses Kafka’s partitioning model to horizontally scale processing while maintaining strong ordering guarantees. This ensures high performance, scalability, and operational simplicity for production environments. A key benefit of this design decision is that you do not have to understand and tune two different messaging layers – one for moving data streams at scale (Kafka) plus a separate one for your stream processing tool. Similarly, any performance and reliability improvements of Kafka will automatically be available to Kafka Streams, too, thus tapping into the momentum of Kafka’s strong developer community.
- Is agnostic to resource management and configuration tools, so it integrates much more seamlessly into the existing development, packaging, deployment, and operational practices of your organization. You are free to use your favorite tools such as Java application servers, Puppet, Ansible, Mesos, YARN, Docker – or even to run your application manually on a single machine for proof-of-concept scenarios.
- Supports fault-tolerant local state, which enables very fast and efficient stateful operations like joins and windowed aggregations. Local state is replicated to Kafka so that, in case of a machine failure, another machine can automatically restore the local state and resume the processing from the point of failure.
- Employs one-record-at-a-time processing to achieve low processing latency, which is crucial for a variety of use cases such as fraud detection. This makes Kafka Streams different from micro-batch based stream processing tools.
Furthermore, Kafka Streams has a strong focus on usability and a great developer experience. It offers all the necessary stream processing primitives to allow applications to read data from Kafka as streams, process the data, and then either write the resulting data back to Kafka or send the final output to an external system. Developers can choose between a high-level DSL with commonly used operations like
join, as well as a low-level API for developers who need maximum control and flexibility.
Finally, Kafka Streams helps with scaling developers, too – yes, the human side – because it has a low barrier to entry and a smooth path to scale from development to production: You can quickly write and run a small-scale proof-of-concept on a single machine because you don’t need to install or understand a distributed stream processing cluster; and you only need to run additional instances of your application on multiple machines to scale up to high-volume production workloads. Kafka Streams transparently handles the load balancing of multiple instances of the same application by leveraging Kafka’s parallelism model.
In summary, Kafka Streams is a compelling choice for building stream processing applications. Give it a try and run your first Hello World Streams application! The next sections in this documentation will get you started.
Reading tip: If you are interested in learning about our original motivation to create Kafka Streams, you may want to read the Confluent blog post Introducing Kafka Streams: Stream Processing Made Simple.
- Kafka 0.10.0.0-cp1 (earlier 0.9, 0.8, 0.7 versions of Kafka are not supported)
- [Optional] For additional Avro schema support: Confluent Schema Registry 3.0.0