Apache Kafka Glossary¶
New to Apache Kafka® and Confluent or looking for definitions? The terms below provide brief explanations and links to related content for important terms you’ll encounter when working with the Confluent event streaming platform.
- Admin API¶
- The Apache Kafka® REST API that enables administrators to manage and monitor Kafka clusters, topics, brokers, and other Kafka components.
- Ansible Playbooks for Confluent Platform¶
- A set of Ansible playbooks and roles that are designed to automate the deployment and management of Confluent Platform.
- Apache Kafka¶
An open source event streaming platform that provides a unified, high-throughput, low-latency, fault-tolerant, scalable, distributed, and secure data streaming platform.
Kafka is a publish-and-subscribe messaging system that enables distributed applications to ingest, process, and share data in real-time.
- audit log¶
A historical record of actions and operations that are triggered when an auditable event occurs.
Audit log records can be used to troubleshoot system issues, manage security, and monitor compliance, by tracking administrative activity, data access and modification, monitoring sign-in attempts, and reconstructing security breaches and fraudulent activity.
- auditable event¶
An event that represents an action or operation that can be tracked and monitored for security purposes and compliance.
When an auditable event occurs, an auditable event method is triggered and an event message is sent to the audit log cluster and stored as an audit log record.
The process of verifying the identity of a principal that interacts with a system or application. Authentication is often used in conjunction with authorization to determine whether a principal is allowed to access a resource and perform a specific action or operation on that resource.
Digital authentication requires one or more of the following: something a principal knows (a password or security question), something a principal has (a security token or key), or something a principal is (a biometric characteristic, such as a fingerprint or voiceprint).
Multi-factor authentication (MFA) requires two or more forms of authentication.
Related terms: authorization, identity, identity provider, identity pool, principal, role
The process of evaluating and then granting or denying a principal a set of permissions required to access and perform operations on resources.
Related terms: authorization, identity, identity provider, identity pool, principal, role
A data serialization and exchange framework that provides data structures, remote procedure call (RPC), compact binary data format, a container file, and uses JSON to represent schemas.
Avro schemas ensure that every field is properly described and documented for use with serializers and deserializers. You can either send a schema with every message or use Schema Registry to store and receive schemas for use by consumers and producers to save bandwith and storage space.
- batch processing¶
The method of collecting a large volume of data over a specific time interval, after which the data is processed all at once and loaded into a destination system.
Batch processing is often used when processing data can occur independently of the source and timing of the data. It is efficient for non-real-time data processing, such as data warehousing, reporting, and analytics.
- CIDR block¶
A group of IP addresses that are contiguous and can be represented as a single block. CIDR blocks are expressed using Classless Inter-domain Routing (CIDR) notation that includes an IP address and a number of bits in the network mask.
- Cluster Linking¶
A highly performant data replication feature that enables links between Kafka clusters to mirror data from one cluster to another. Cluster Linking creates perfect copies of Kafka topics, which keep data in sync across clusters. Use cases include geo-replication of data, data sharing, migration, disaster recovery, and tiered separation of critical applications.
- commit log¶
A log of all event messages about commits (changes or operations made) sent to a Kafka topic.
A commit log ensures that all event messages are processed at least once and provides a mechanism for recovery in the event of a failure.
The commit log is also referred to as a write-ahead log (WAL) or a transaction log.
- Confluent Cloud¶
A fully managed, cloud-native event streaming service powered by Apache Kafka® technology that removes the complexity of running and managing Kafka clusters and provides a unified, high-throughput, low-latency, fault-tolerant, scalable, distributed, and secure platform.
- Confluent Cloud network¶
An abstraction for a single tenant network environment that hosts Dedicated Kafka clusters in Confluent Cloud along with their single tenant services, like ksqlDB clusters and managed connectors.
- Confluent for Kubernetes (CFK)¶
- A cloud-native control plane for deploying and managing Confluent in private cloud environments through declarative API.
- Confluent Platform¶
- A specialized distribution of Apache Kafka® at its core, with additional components for data integration, streaming data pipelines, and stream processing.
- Confluent REST Proxy¶
The Confluent REST Proxy provides a RESTful interface to an Apache Kafka® cluster, making it easy to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.
- Confluent Platform: REST Proxy
- Confluent Server¶
- A component of Confluent Platform that includes Kafka and additional commercial features. Confluent Server is fully compatible with Kafka, and you can migrate in place between Kafka and Confluent Server. For more information, see Confluent Platform Packages.
- Confluent Unit for Kafka (CKU)¶
A unit of horizontal scaling for Dedicated Kafka clusters in Confluent Cloud that provide preallocated resources.
CKUs determine the capacity of a Dedicated Kafka cluster in Confluent Cloud.
- Connect API¶
- The Kafka API that enables a connector to read event streams from a source system and write to a target system.
- Connect worker¶
A server process that runs a connector and performs the actual work of moving data in and out of Kafka topics.
A worker is a server process that runs on hardware independent of the Kafka brokers themselves. It is scalable and fault-tolerant, meaning you can run a cluster of workers that share the load of moving data in and out of Kafka from and to external systems.
- An abstract mechanism that enables communication, coordination, or cooperation among components by transferring data elements from one interface to another without changing the data.
A client application that subscribes to (reads and processes) event messages from a Kafka topic.
The Streams API and the Consumer API are the two APIs that enable consumers to read event streams from Kafka topics.
- Consumer API¶
The Kafka API used for consuming (reading) event messages or records from Kafka topics and enables a Kafka consumer to subscribe to a topic and read event messages as they arrive.
Batch processing is a common use case for the Consumer API.
- consumer group¶
A single logical consumer implemented with multiple physical consumers for reasons of throughput and resilience.
By dividing topics among consumers in the group into partitions, consumers in the group can process messages in parallel, increasing message throughput and enabling load balancing.
- consumer lag¶
The number of consumer offsets between the latest message produced in a partition and the last message consumed by a consumer, that is the number of messages pending to be consumed from a particular partition.
A large consumer lag, or a quickly growing lag, indicates that the consumer is unable to read from a partition as fast as the messages are available. This can be caused by a slow consumer, slow network, or slow broker.
- consumer offset¶
The unique and monotonically increasing integer value that uniquely identifies the position of an event record in a partition.
When a consumer acknowledges the receiving and processing a message, it commits an offset value that is stored in the special internal topic
- An acronym for the four basic operations that can be performed on data: Create, Read, Update, and Delete.
- custom connector¶
- A connector created using Connect plugins uploaded to Confluent Cloud by users. This includes connector plugins that are built from scratch, modified open-source connector plugins, or third-party connector plugins.
- data at rest¶
- Data that is physically stored on non-volatile media (such as hard drives, solid-state drives, or other storage devices) and is not actively being transmitted or processed by a system.
- data encryption key (DEK)¶
A symmetric key that is used to encrypt and decrypt data. The DEK is used in client-side field level encryption (CSFLE) to encrypt sensitive data. The DEK is itself encrypted using a key encryption key (KEK) that is only accessible to authorized users. The encrypted DEK and encrypted data are stored together. Only users with access to the KEK can decrypt the DEK and access the sensitive data.
Related terms: envelope encryption, key encryption key (KEK)
- data in motion¶
Data that is actively being transferred between source and destination, typically systems, devices, or networks.
Data in motion is also referred to as data in transit or data in flight.
- data in use¶
- Data that is actively being processed or manipulated in memory (RAM, CPU caches, or CPU registers).
- data ingestion¶
- The process of collecting, importing, and integrating data from various sources into a system for further processing, analysis, or storage.
- data mapping¶
The process of defining relationships or associations between source data elements and target data elements.
Data mapping is an important process in data integration, data migration, and data transformation, ensuring that data is accurately and consistently represented when it is moved or combined.
- data pipeline¶
A series of processes and systems that enable the flow of data from sources to destinations, automating the movement and tranformation of data for various purposes, such as analytics, reporting, or machine learning.
A data pipeline typically comprised of a source system, a data ingestion tool, a data transformation tool, and a target system. A data pipeline covers the following stages: data extraction, data transformation, data loading, and data validation.
- data serialization¶
The process of converting data structures or objects into a format that can be stored or transmitted, and reconstructed later in the same or another computer environment.
Data serialization is a common technique for implementing data persistence, interprocess communication, and object communication. Confluent Schema Registry (in Confluent Platform) and Confluent Cloud Schema Registry support data serialization using serializers and deserializers for the following formats: Avro, JSON Schema, and Protobuf.
- data steward¶
- A person with data-related responsibilities, such as data governance, data quality, and data security.
- data stream¶
- A continuous flow of data records that are produced and consumed by applications.
- dead letter queue (DLQ)¶
- A queue where messages that could not be processed successfully by a sink connector are placed. Instead of stopping, the sink connector sends messages that could not be written successfully as event records to the DLQ topic while the sink connector continues processing messages.
A tool that converts a serial byte stream back into objects and parallel data. Deserializers work with serializers (known together as Serdes) to support efficient storage and high-speed data transmission over the wire. Confluent provides Serdes for schemas in Avro, Protobuf, and JSON Schema formats.
An acronym for Extract-Load-Transform, where data is extracted from a source system and loaded into a target system before processing or transformation.
Compared to ETL, ELT is a more flexible approach to data ingestion because the data is loaded into the target system before transformation.
- envelope encryption¶
A cryptographic technique that uses two keys to encrypt data. The symmetric data encryption key (DEK) is used to encrypt sensitive data. The separate asymmetric key encryption key (KEK) is the master key used to encrypt the DEK. The DEK and encrypted data are stored together. Only users with access to the KEK can decrypt the DEK and access the sensitive data.
In Confluent Cloud, envelope encryption is used to enable client-side field level encryption (CSFLE). CSFLE encrypts sensitive data in a message before it is sent to Confluent Cloud and allows for temporary decryption of sensitive data when required to perform operations on the data.
Related terms: data encryption key (DEK), key encryption key (KEK)
An acronym for Extract-Transform-Load, where data is extracted from a source system, transformed into a target format, and loaded into a target system.
Compared to ELT, ETL is a more rigid approach to data ingestion because the data is transformed before loading into the target system.
A meaningful action or occurrence of something that happened.
Events that can be recognized by a program, either human-generated or triggered by software, can be recorded in a log file or other data store.
- event message¶
A record of an event that is sent to a Kafka topic, represented as a key-value pair.
Each event message consists of a key-value pair, a timestamp, the compression type, headers for metadata (optional), and a partition and offset ID (once the message is written). The key is optional and can be used to identify the event. The value is required and contains details about the event that happened.
- event record¶
The record of the event that is stored in a Kafka topic.
Event records are organized and durably stored in topics. Examples of events include orders, payments, activities, or measurements. An event typically contains one or more data fields that describe the fact, as well as a timestamp that denotes when the event was created by its event source. The event may also contain various metadata, such as its source of origin (for example, the application or cloud service that created the event) and storage-level information (for example, its position in the event stream).
- event sink¶
- A consumer of events, which can include applications, cloud services, databases, IoT sensors, and more.
- event source¶
- A producer of events, which can include cloud services, databases, IoT sensors, mainframes, and more.
- event stream¶
- A continuous flow of event messages produced by an event source and consumed by one or more consumers.
- event streaming¶
The practice of capturing event data in real-time from data sources.
Event streaming is a form of data streaming that is used to capture, store, process, and react to data in real-time or retrospectively.
- event streaming platform¶
- A platform that events can be written to once, allowing distributed functions within an organization to react in realtime.
- exactly-once semantics¶
A guarantee that a message is delivered exactly once and in the order that it was sent.
Even if a producer retries sending a message, or a consumer retries processing a message, the message is delivered exactly once. This guarantee is achieved by the broker assigning a unique ID to each message and storing the ID in the consumer offset. The consumer offset is committed to the broker only after the message is processed. If the consumer fails to process the message, the message is redelivered and processed again.
The degree or level of detail to which an entity (a system, service, or resource) is broken down into subcomponents, parts, or elements.
Entities that are fine-grained have a higher level of detail, while coarse-grained entities have a reduced level of detail, often combining finer parts into a larger whole.
In the context of access control, granular permissions provide precise control over resource access. They allow administrators to grant specific operations on distinct resources. This ensures users only have permissions tailored to their needs, minimizing unnecessary or potentially risky access.
A unique identifier that is used to authenticate and authorize users and applications to access resources.
Identity is often used in conjunction with access control to determine whether a user or application is allowed to access a resource and perform a specific action or operation on that resource.
Related terms: identity provider, identity pool, principal, role
- identity pool¶
A collection of identities that can be used to authenticate and authorize users and applications to access resources.
Identity pools are used to manage permissions for users and applications that access resources in Confluent Cloud. They are also used to manage permissions for Confluent Cloud service accounts that are used to access resources in Confluent Cloud.
- identity provider¶
A trusted provider that authenticates users and issues security tokens that are used to verify the identity of a user.
Identity providers are often used in single sign-on (SSO) scenarios, where a user can log in to multiple applications or services with a single set of credentials.
- internal topic¶
A topic, prefixed with double underscores (“__”), that is automatically created by a Kafka component to store metadata about the broker, partition assignment, consumer offsets, and other information.
Internal topic examples:
- JSON Schema¶
A declarative language used for data serialization and exchange to define data structures, specify formats, and validate JSON documents. It is a way to encode expected data types, properties, and constraints to ensure that all fields are properly described for use with serializers and deserializers.
- Kafka bootstrap server¶
A Kafka broker that a Kafka client initiates a connection to a Kafka cluster and returns metadata, which includes the addresses for all of the brokers in the Kafka cluster.
Although only one bootstrap server is required to connect to a Kafka cluster, multiple brokers can be specified in a bootstrap server list to provide high availability and fault tolerance in case a broker is unavailable. In Confluent Cloud, the bootstrap server is the general cluster endpoint.
- Kafka broker¶
A server in the Kafka storage layer that stores event streams from one or more sources.
A Kafka cluster is typically comprised of several brokers. Every broker in a cluster is also a bootstrap server, meaning if you can connect to one broker in a cluster, you can connect to every broker.
- Kafka client¶
Kafka clients allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner, even in the case of network problems or machine failures.
The Kafka client library provides functions, classes, and utilities that allow developers to create Kafka producer clients (Producers) and consumer clients (Consumers) using various programming languages. The primary way to build production-ready Producers and Consumers is by using your preferred programming language and a Kafka client library.
- Kafka cluster¶
A group of interconnected Kafka brokers that manage and distribute real-time data streaming, processing, and storage as if they are a single system.
By distributing tasks and services across multiple Kafka brokers, the Kafka cluster improves availability, reliability, and performance.
- Kafka Connect¶
The component of Apache Kafka® that provides data integration between databases, key-value stores, search indexes, file systems, and Kafka brokers.
Kafka Connect is an ecosystem of a client application and pluggable connectors. As a client application, Connect is a server process that runs on hardware independent of the Kafka brokers themselves. It is scalable and fault-tolerant, meaning you can run a cluster of Connect workers that share the load of moving data in and out of Kafka from and to external systems. Connect also abstracts the business of code away from the user and instead requires only JSON configuration to run.
- Kafka listener¶
An endpoint that Kafka brokers bind to use to communicate with clients.
For Kafka clusters, Kafka listeners are configured in the
listenersproperty of the
server.propertiesfile. Advertised listeners are publicly accessible endpoints that are used by clients to connect to the Kafka cluster.
- Kafka Streams¶
A stream procssing library for building streaming applications and microservices that transform (filter, group, aggregate, join, and more) incoming event streams in real-time to Kafka topics stored in an Kafka cluster.
The Streams API can be used to build applications that process data in real-time, analyze data continuously, and build data pipelines.
- Kafka topic¶
A user-defined category or feed name where event messages are stored and published by producers and subscribed to by consumers.
Each topic is a log of event messages. Topics are stored in one or more partitions, which distribute topic records brokers in a Kafka cluster. Each partition is an ordered, immutable sequence of records that are continually appended to a topic.
- key encryption key (KEK)¶
A master key that is used to encrypt and a decrypt other keys, specifically the data encryption key (DEK). Only users with access to the KEK can decrypt the DEK and access the sensitive data.
Related terms: data encryption key (DEK), envelope encryption.
The cloud-native streaming data service based on Apache Kafka® technology that powers the fully managed, scalable, and elastic Confluent Cloud event streaming platform for building real-time data pipelines and streaming applications.
- A streaming SQL database engine purpose-built for creating stream processing applications on top of Apache Kafka®.
- logical Kafka cluster (LKC)¶
A logical entity whose topic partitions are mapped to the brokers of a physical Kafka cluster (PKC) in Confluent Cloud.
One or more LKCs might be mapped to the same PKC. If the mapping is one-to-one, the LKC is said to be mapped to a single PKC (Dedicated). If the mapping is many-to-one, the LKC is said to be mapped to a multitenant Kafka cluster (Basic or Standard).
- multi-region cluster (MRC)¶
- A single Kafka cluster that replicates data between datacenters across regional availability zones.
An integer assigned to each message that uniquely represents its position within the partition of a Kafka topic, guranteeing the ordering of records and allowing consumers to replay messages from any point in time.
Offsets are stored on the Kafka broker, and consumers are responsible for committing their own offsets. Kafka does not track which records have been read by a consumer and which have not. It is up to the consumer to track this information.
Related terms: consumer offset, producer offset, commit offset
To commit an offset is to acknowledge that a record has been consumed, and, should your consumer group fail, to continue from that offset.
Related terms: consumer offset, offset commmit, replayability
- offset commit¶
The process of a consumer acknowledging that an event message has been consumed and storing its current offset position for a specific partition within a consumer group.
When a consumer commits its offset, it is committing the offset for the next message it will consume. For example, if a consumer has an offset of 5, it has consumed messages 0 through 4 and will next consume message 5.
If the consumer crashes or is shut down, its partitions are reassigned to another consumer which initiates consuming from the last committed offset of each partition.
The offset commit is stored on an Kafka broker. When a consumer commits an offset, it sends a commit request to the Kafka cluster, specifying the partition and offset it wants to commit for a particular consumer group. The Kafka broker receiving the commit request then stores this offset in the
Related terms: consumer offset, offset
A unit of data storage that divides a topic into multiple, parallel event streams, each of which is stored on separate Kafka brokers and can be consumed independently.
Partitioning is a key concept in Kafka because it allows Kafka to scale horizontally by adding more brokers to the cluster. Partitions are also the unit of parallelism in Kafka. A topic can have one or more partitions, and each partition is an ordered, immutable sequence of event records that is continually appended to a partition log.
- physical Kafka cluster (PKC)¶
A Kafka cluster comprised of multiple brokers and ZooKeeper.
Each physical Kafka cluster is created on a Kubernetes cluster by the control plane. A PKC is not directly accessible by clients.
An entity that can be authenticated and granted permissions based on roles to access resources and perform operations. An entity can be a user (user account), group, service account, or identity pool.
Related terms: identity, identity pool, role, service account, user account/
- private internet¶
- A closed, restricted computer network typically used by organizations to provide secure environments for managing sensitive data and resources.
A client application that publishes (writes) data to a topic in an Kafka cluster.
Producers write data to a topic and are the only clients that can write data to a topic. Each record written to a topic is appended to the partition of the topic that is selected by the producer.
- Producer API¶
The Kafka API that allows you to write data to a topic in an Kafka cluster.
The Producer API is used by producer clients to publish data to a topic in an Kafka cluster.
Protocol Buffers (Protobuf) is an open-source data format used to serialize structured data for storage.
- public internet¶
- The global system of interconnected computers and networks that use TCP/IP to communicate with each other.
The process of redistributing the partitions of a topic among the consumers of a consumer group for improved performance and scalability.
A rebalance can occur if a consumer has failed the heartbeat and has been excluded from the group, it voluntarily left the group, metadata has been updated for a consumer, or a consumer has joined the group.
The ability to replay messages from any point in time.
Related terms: consumer offset, offset, offset commit
- The process of creating and maintaining multiple copies (or replicas) of data across different nodes in a distributed system to increase availability, reliability, redundancy, and accessibility.
- replication factor¶
- The number of copies of a partition that are distributed across the brokers in a cluster.
A Confluent-defined job function assigned a set of permissions required to perform specific actions or operations on Confluent resources bound to a principal and Confluent resources. A role can be assigned to a user account, group, service account, or identity pool.
Related terms: identity, identity pool, principal, service account
- rolling restart¶
Restart the brokers in a Kafka cluster with zero downtime by incrementally restarting a Kafka broker after verifying that there are no under-replicated partitions on the broker before proceeding to the next broker.
Restarting the brokers one at a time allows for software upgrades, broker configuration updates, or cluster maintenance while maintaining high availability by avoiding downtime.
The structured definition or blueprint used to describe the format and structure event messages sent through the Kafka event streaming platform.
Schemas are used to validate the structure of data in event messages and ensures that producers and consumers are sending and receiving data in the same format. Schemas are defined in the Schema Registry.
- Schema Registry¶
A centralized repository for managing and validating schemas for topic message data that stores and manages schemas for Kafka topics. Schema Registry is built into Confluent Cloud as a managed service, available with the “Advanced” Stream Governance package, and offered as part of Confluent Enterprise for self-managed deployments.
The Schema Registry is a RESTful service that stores and manages schemas for Kafka topics. The Schema Registry is integrated with Kafka and Connect to provide a central location for managing schemas and validating data. Producers and consumers to Kafka topics use schemas to ensure data consistency and compatibility as schemas evolve. Schema Registry is a key component of Stream Governance.
Serializers and deserializers that convert objects and parallel data into a serial byte stream for efficient storage and high-speed data transmission over the wire. Confluent provides Serdes for schemas in Avro, Protobuf, and JSON Schema formats.
A tool that converts objects and parallel data into a serial byte stream. Serializers work with deserializers (known together as Serdes) to support efficient storage and high-speed data transmission over the wire. Confluent provides serializers for schemas in Avro, Protobuf, and JSON Schema formats.
- service account¶
A non-person entity used by an application or service to access resources and perform operations.
Because a service account is an identity independent of the user who created it, it can be used programmatically to authenticate to resources and perform operations without the need for a user to be signed in.
- service quota¶
The limit, or maximum value, for a specific Confluent Cloud resource or operation that might vary by the resource scope it applies to.
- single message transform (SMT)¶
- A transformation or operation applied in realtime on an individual message that changes the values, keys, or headers of a message before being sent to a sink connector or after being read from a source connector. SMTs are convenient for inserting fields, masking information, event routing, and other minor data adjustments.
- sink connector¶
- A Kafka Connect connector that publishes (writes) data from a Kafka topic to an external system.
- source connector¶
- A Kafka Connect connector that subscribes (reads) data from a source (external system), extracts the payload and schema of the data, and publishes (writes) the data to Kafka topics.
- static egress IP address¶
An IP address used by a Confluent Cloud managed connector to establish outbound connections to endpoints of external data sources and sinks over the public internet.
- Stream Designer¶
A graphical tool that lets you visually build streaming data pipelines powered by Apache Kafka.
- Stream Governance¶
A collection of tools and features that provide data governance for data in motion. These include data quality tools such as Schema Registry, schema validation, and schema linking; built-in data catalog capabilities to classify, organize, and find event streams across systems; and stream lineage to visualize complex data relationships and uncover insights with interactive, end-to-end maps of event streams.
Taken together, these and other governance tools enable teams to manage the availability, integrity, and security of data used across organizations, and help with standardization, monitoring, collaboration, reporting, and more.
- stream lineage¶
The life cycle, or history, of data, including its origins, tranformations, and consumption, as it moves through various stages in data pipelines, applications, and systems.
Stream lineage provides a record of data’s journey from its source to its destination, and is used to track data quality, data governance, and data security.
- stream processing¶
The method of collecting event stream data in real-time as it arrives, transforming the data in real-time using operations (such as filters, joins, and aggregations), and publishing the results to one or more target systems.
Stream processing can be used to analyze data continuously, build data pipelines, and process time-sensitive data in real-time. Using the Confluent event streaming platform, event streams can be processed in real-time using Kafka Streams, Kafka Connect, or ksqlDB.
- Streams API¶
The Kafka API that allows you to build streaming applications and microservices that transform (for example, filter, group, aggregate, join) incoming event streams in real-time to Kafka topics stored in a Kafka cluster.
The Streams API is used by stream processing clients to process data in real-time, analyze data continuously, and build data pipelines.
- under replication¶
When the number of in-sync replicas is below the number of all replicas.
Under Replicated partitions can occur when a broker is down or cannot replicate fast enough from the leader (replica fetcher lag).
- user account¶
An account representing the identity of a person who can be authenticated and granted access to Confluent Cloud resources.