Stream Processing Concepts in ksqlDB for Confluent Platform¶
ksqlDB enables stream processing, which is a way to compute over events as they arrive, rather than in batches at a later time. These events come from Apache Kafka® topics. In ksqlDB, events are stored in a stream, which is a Kafka topic with a defined schema.
When you create a stream in ksqlDB, if the backing Kafka topic doesn’t exist, ksqlDB creates it with the specified number of partitions. The stream’s metadata (schema, serialization scheme, etc.) is stored in ksqlDB’s command topic, which is an internal communication channel. Each ksqlDB server keeps a local copy of this metadata.
Events are added to a stream as rows, which are essentially Kafka records with extra metadata. ksqlDB uses a Kafka producer to insert these records into the backing Kafka topic. The data itself is persisted in Kafka, not on the ksqlDB servers.
ksqlDB offers a SQL-like interface for transforming streams. You can create new streams derived from existing ones by selecting and manipulating columns. This is done with persistent queries. For example, you can filter a stream or convert the case of a string field.
ksqlDB also supports stateful operations. This means that the processing of an event can depend on the accumulated effects of previous events. State can be used for simple aggregations, like counting events, or more complex operations, like feature engineering for machine learning. Each parallel instance of a ksqlDB application handles events for a specific group of keys, and the state for those keys is kept locally. This allows for high throughput and low latency.
For fault tolerance, ksqlDB uses state snapshots and stream replay. Snapshots capture the entire state of the pipeline, including offsets in the input queues and the state derived from processed data. In case of failure, the pipeline can be restored from the snapshot, and it can replay the stream from the saved offsets. Sources tables are not kept entirely in state. Stateless operations, like filtering and projections, don’t require state.
- Apache Kafka and ksqlDB: A quick overview of Kafka.
- Connectors in ksqlDB: Connectors source and sink data from external systems.
- Events in ksqlDB: An event is the fundamental unit of data in stream processing.
- Joins: Joins are how to combine data from many streams and tables into one.
- User-defined Functions: Extend ksqlDB to invoke custom code written in Java.
- Lambda Functions: Lambda functions enable you to apply in-line functions without creating a full UDF.
- Materialized Views: Materialized views precompute the results of queries at write-time so reads become predictably fast.
- Queries: Queries are how you process events and retrieve computed results.
- Stream Processing: Stream processing is a way to write programs computing over unbounded streams of events.
- Streams: A stream is an immutable, append-only collection of events that represents a series of historical facts.
- Tables: A table is a mutable collection of events that models change over time.
- Time and Windows: Windows help you bound a continuous stream of events into distinct time intervals.