New to Confluent or looking for definitions? The terms below provide brief explanations and links to related content for important terms you’ll encounter when working with the Confluent event streaming platform.
- Admin API¶
- The Apache Kafka® REST API that enables administrators to manage and monitor Kafka clusters, topics, brokers, and other Kafka components.
- Ansible Playbooks for Confluent Platform¶
- A set of Ansible playbooks and roles that are designed to automate the deployment and management of Confluent Platform.
- Apache Kafka¶
An open source event streaming platform that provides a unified, high-throughput, low-latency, fault-tolerant, scalable, distributed, and secure data streaming platform.
Kafka is a publish-and-subscribe messaging system that enables distributed applications to ingest, process, and share data in real-time.
- audit log¶
A historical record of actions and operations that are triggered when an auditable event occurs.
Audit log records can be used to troubleshoot system issues, manage security, and monitor compliance, by tracking administrative activity, data access and modification, monitoring sign-in attempts, and reconstructing security breaches and fraudulent activity.
- auditable event¶
An event that represents an action or operation that can be tracked and monitored for security purposes and compliance.
When an auditable event occurs, an auditable event method is triggered and an event message is sent to the audit log cluster and stored as an audit log record.
- The process of verifying the identity of a principal (user or service account) that interacts with a system or application.
- The process of evaluating and then granting or denying a principal (user or service account) permissions required to access and perform specific actions or operations on specific resources.
A data serialization and exchange framework that provides data structures, remote procedure call (RPC), compact binary data format, a container file, and uses JSON to represent schemas.
Avro schemas ensure that every field is properly described and documented for use with serializers and deserializers. You can either send a schema with every message or use Schema Registry to store and receive schemas for use by consumers and producers to save bandwith and storage space.
- batch processing¶
The method of collecting a large volume of data over a specific time interval, after which the data is processed all at once and loaded into a destination system.
Batch processing is often used when processing data can occur independently of the source and timing of the data. It is efficient for non-real-time data processing, such as data warehousing, reporting, and analytics.
- CIDR block¶
A group of IP addresses that are contiguous and can be represented as a single block. CIDR blocks are expressed using Classless Inter-domain Routing (CIDR) notation that includes an IP address and a number of bits in the network mask.
- Cluster Linking¶
A highly performant data replication feature that enables links between Kafka clusters to mirror data from one cluster to another. Cluster Linking creates perfect copies of Kafka topics, which keep data in sync across clusters. Use cases include geo-replication of data, data sharing, migration, disaster recovery, and tiered separation of critical applications.
- commit log¶
A log of all event messages about commits (changes or operations made) sent to a Kafka topic.
A commit log ensures that all event messages are processed at least once and provides a mechanism for recovery in the event of a failure.
The commit log is also referred to as a write-ahead log (WAL) or a transaction log.
- Confluent Cloud¶
A fully managed, cloud-native event streaming service powered by Apache Kafka® technology that removes the complexity of running and managing Kafka clusters and provides a unified, high-throughput, low-latency, fault-tolerant, scalable, distributed, and secure platform.
- Confluent Cloud Overview
- Confluent Cloud <https://www.confluent.io/confluent-cloud/>
- Confluent Cloud network¶
An abstraction for a single tenant network environment that hosts Dedicated Kafka clusters in Confluent Cloud along with their single tenant services, like ksqlDB clusters and managed connectors.
- Confluent for Kubernetes (CFK)¶
- A cloud-native control plane for deploying and managing Confluent in private cloud environments through declarative API.
- Confluent Platform¶
- A specialized distribution of Apache Kafka® at its core, with additional components for data integration, streaming data pipelines, and stream processing.
- Confluent Unit for Kafka (CKU)¶
A unit of horizontal scaling for Dedicated Kafka clusters in Confluent Cloud that provide preallocated resources.
CKUs determine the capacity of a Dedicated Kafka cluster in Confluent Cloud.
- Connect API¶
- The Kafka API that enables a connector to read event streams from a source system and write to a target system.
- Connect worker¶
A server process that runs a connector and performs the actual work of moving data in and out of Kafka topics.
A worker is a server process that runs on hardware independent of the Kafka brokers themselves. It is scalable and fault-tolerant, meaning you can run a cluster of workers that share the load of moving data in and out of Kafka from and to external systems.
- An abstract mechanism that enables communication, coordination, or cooperation among components by transferring data elements from one interface to another without changing the data.
A client application that subscribes to (reads and processes) event data from a Kafka topic.
The Streams API and the Consumer API are the two APIs that enable consumers to read event streams from Kafka topics.
- Consumer API¶
The Kafka API used for consuming (reading) event messages or records from Kafka topics and enables a Kafka consumer to subscribe to a topic and read event messages as they arrive.
Batch processing is a common use case for the Consumer API.
- consumer group¶
A collection of one or more consumers that work together to consume event messages from a topic.
By dividing topics among consumers in the group into partitions, consumers in the group can process messages in parallel, increasing message throughput and enabling load balancing.
- consumer lag¶
The number of consumer offsets between the latest message produced in a partition and the last message consumed by a consumer, that is the number of messages pending to be consumed from a particular partition.
A large consumer lag, or a quickly growing lag, indicates that the consumer is unable to read from a partition as fast as the messages are available. This can be caused by a slow consumer, slow network, or slow broker.
- consumer offset¶
The unique and monotonically increasing integer value that uniquely identifies the position of an event record in a partition.
When a consumer acknowledges the receiving and processing a message, it commits an offset value that is stored in the special internal topic
- An acronym for the four basic operations that can be performed on data: Create, Read, Update, and Delete.
- data at rest¶
- Data that is physically stored on non-volatile media (such as hard drives, solid-state drives, or other storage devices) and is not actively being transmitted or processed by a system.
- data in motion¶
Data that is actively being transferred between source and destination, typically systems, devices, or networks.
Data in motion is also referred to as data in transit or data in flight.
- data in use¶
- Data that is actively being processed or manipulated in memory (RAM, CPU caches, or CPU registers).
- data ingestion¶
- The process of collecting, importing, and integrating data from various sources into a system for further processing, analysis, or storage.
- data mapping¶
The process of defining relationships or associations between source data elements and target data elements.
Data mapping is an important process in data integration, data migration, and data transformation, ensuring that data is accurately and consistently represented when it is moved or combined.
- data pipeline¶
A series of processes and systems that enable the flow of data from sources to destinations, automating the movement and tranformation of data for various purposes, such as analytics, reporting, or machine learning.
A data pipeline typically comprised of a source system, a data ingestion tool, a data transformation tool, and a target system. A data pipeline covers the following stages: data extraction, data transformation, data loading, and data validation.
- data serialization¶
The process of converting data structures or objects into a format that can be stored or transmitted, and reconstructed later in the same or another computer environment.
Data serialization is a common technique for implementing data persistence, interprocess communication, and object communication. Confluent Schema Registry (in Confluent Platform) and Confluent Cloud Schema Registry support data serialization using serializers and deserializers for the following formats: Avro, JSON Schema, and Protobuf.
- data steward¶
- A person with data-related responsibilities, such as data governance, data quality, and data security.
- data stream¶
- A continuous flow of data records that are produced and consumed by applications.
- egress static IP address¶
An IP address used by a Confluent Cloud managed connector to establish outbound connections to endpoints of external data sources and sinks over the public internet.
An acronym for Extract-Load-Transform, where data is extracted from a source system and loaded into a target system before processing or transformation.
Compared to ETL, ELT is a more flexible approach to data ingestion because the data is loaded into the target system before transformation.
An acronym for Extract-Transform-Load, where data is extracted from a source system, transformed into a target format, and loaded into a target system.
Compared to ELT, ETL is a more rigid approach to data ingestion because the data is transformed before loading into the target system.
A meaningful action or occurrence of something that happened.
Events that can be recognized by a program, either human-generated or triggered by software, can be recorded in a log file or other data store.
- event message¶
A record of an event that is sent to a Kafka topic, represented as a key-value pair.
Each event message consists of a key-value pair, a timestamp, the compression type, headers for metadata (optional), and a partition and offset ID (once the message is written). The key is optional and can be used to identify the event. The value is required and contains details about the event that happened.
- event record¶
The record of the event that is stored in a Kafka topic.
Event records are organized and durably stored in topics. Examples of events include orders, payments, activities, or measurements. An event typically contains one or more data fields that describe the fact, as well as a timestamp that denotes when the event was created by its event source. The event may also contain various metadata, such as its source of origin (for example, the application or cloud service that created the event) and storage-level information (for example, its position in the event stream).
- event sink¶
- A consumer of events, which can include applications, cloud services, databases, IoT sensors, and more.
- event source¶
- A producer of events, which can include cloud services, databases, IoT sensors, mainframes, and more.
- event stream¶
- A continuous flow of event messages produced by an event source and consumed by one or more consumers.
- event streaming¶
The practice of capturing event data in real-time from data sources.
Event streaming is a form of data streaming that is used to capture, store, process, and react to data in real-time or retrospectively.
- event streaming platform¶
- A platform that events can be written to once, allowing distributed functions within an organization to react in realtime.
- internal topic¶
A topic, prefixed with double underscores (“__”), that is automatically created by a Kafka component to store metadata about the broker, partition assignment, consumer offsets, and other information.
Internal topic examples:
- JSON Schema¶
A declarative language used for data serialization and exchange to define data structures, specify formats, and validate JSON documents. It is a way to encode expected data types, properties, and constraints to ensure that all fields are properly described for use with serializers and deserializers.
- Kafka bootstrap server¶
A Kafka broker that a Kafka client initiates a connection to a Kafka cluster and returns metadata, which includes the addresses for all of the brokers in the Kafka cluster.
Although only one bootstrap server is required to connect to a Kafka cluster, multiple brokers can be specified in a bootstrap server list to provide high availability and fault tolerance in case a broker is unavailable. In Confluent Cloud, the bootstrap server is the general cluster endpoint.
- Kafka broker¶
A server in the Kafka storage layer that stores event streams from one or more sources.
A Kafka cluster is typically comprised of several brokers. Every broker in a cluster is also a bootstrap server, meaning if you can connect to one broker in a cluster, you can connect to every broker.
- Kafka cluster¶
A group of interconnected Kafka brokers that manage and distribute real-time data streaming, processing, and storage as if they are a single system.
By distributing tasks and services across multiple Kafka brokers, the Kafka cluster improves availability, reliability, and performance.
- Kafka Connect¶
The component of Apache Kafka® that provides data integration between databases, key-value stores, search indexes, file systems, and Kafka brokers.
Kafka Connect is an ecosystem of a client application and pluggable connectors. As a client application, Connect is a server process that runs on hardware independent of the Kafka brokers themselves. It is scalable and fault-tolerant, meaning you can run a cluster of Connect workers that share the load of moving data in and out of Kafka from and to external systems. Connect also abstracts the business of code away from the user and instead requires only JSON configuration to run.
- Confluent Cloud: Kafka Connect
- Confluent Platform: Kafka Connect
- Kafka listener¶
An endpoint that Kafka brokers bind to use to communicate with clients.
For Kafka clusters, Kafka listeners are configured in the
listenersproperty of the
server.propertiesfile. Advertised listeners are publicly accessible endpoints that are used by clients to connect to the Kafka cluster.
- Kafka Streams¶
A stream procssing library for building streaming applications and microservices that transform (filter, group, aggregate, join, and more) incoming event streams in real-time to Kafka topics stored in an Kafka cluster.
The Streams API can be used to build applications that process data in real-time, analyze data continuously, and build data pipelines.
- Kafka topic¶
A user-defined category or feed name where event messages are stored and published by producers and subscribed to by consumers.
Each topic is a log of event messages. Topics are stored in one or more partitions, which distribute topic records brokers in a Kafka cluster. Each partition is an ordered, immutable sequence of records that are continually appended to a topic.
The cloud-native streaming data service based on Apache Kafka® technology that powers the fully managed, scalable, and elastic Confluent Cloud event streaming platform for building real-time data pipelines and streaming applications.
- A streaming SQL database engine purpose-built for creating stream processing applications on top of Apache Kafka®.
- logical Kafka cluster (LKC)¶
A logical entity whose topic partitions are mapped to the brokers of a physical Kafka cluster (PKC) in Confluent Cloud.
One or more LKCs might be mapped to the same PKC. If the mapping is one-to-one, the LKC is said to be mapped to a single PKC (Dedicated). If the mapping is many-to-one, the LKC is said to be mapped to a multitenant Kafka cluster (Basic or Standard).
- multi-region cluster (MRC)¶
- A single Kafka cluster that replicates data between datacenters across regional availability zones.
A unit of data storage that divides a topic into multiple, parallel event streams, each of which is stored on separate Kafka brokers and can be consumed independently.
Partitioning is a key concept in Kafka because it allows Kafka to scale horizontally by adding more brokers to the cluster. Partitions are also the unit of parallelism in Kafka. A topic can have one or more partitions, and each partition is an ordered, immutable sequence of event records that is continually appended to a partition log.
- physical Kafka cluster (PKC)¶
A Kafka cluster comprised of multiple brokers and ZooKeeper.
Each physical Kafka cluster is created on a Kubernetes cluster by the control plane. A PKC is not directly accessible by clients.
An entity (user, program, or application) that can be authenticated and granted permissions based on roles to access resources and perform operations.
In Confluent Cloud, principals include user and service accounts.
- private internet¶
- A closed, restricted computer network typically used by organizations to provide secure environments for managing sensitive data and resources.
A client application that publishes (writes) data to a topic in an Kafka cluster.
Producers write data to a topic and are the only clients that can write data to a topic. Each record written to a topic is appended to the partition of the topic that is selected by the producer.
- Producer API¶
The Kafka API that allows you to write data to a topic in an Kafka cluster.
The Producer API is used by producer clients to publish data to a topic in an Kafka cluster.
Protocol Buffers (Protobuf) is an open-source data format used to serialize structured data for storage, .
- public internet¶
- The global system of interconnected computers and networks that use TCP/IP to communicate with each other.
The process of redistributing the partitions of a topic among the consumers of a consumer group for improved performance and scalability.
A rebalance can occur if a consumer has failed the heartbeat and has been excluded from the group, it voluntarily left the group, metadata has been updated for a consumer, or a consumer has joined the group.
- The process of creating and maintaining multiple copies (or replicas) of data across different nodes in a distributed system to increase availability, reliability, redundancy, and accessibility.
- replication factor¶
- The number of copies of a partition that are distributed across the brokers in a cluster.
A Confluent-defined job function assigned a set of permissions required to perform specific actions or operations on Confluent resources bound to a principal (user or service account) and Confluent resources.
- Confluent Cloud: Confluent Cloud RBAC roles
- Confluent Platform: Role-Based Access Control Predefined Roles
- rolling restart¶
Restart the brokers in a Kafka cluster with zero downtime by incrementally restarting a Kafka broker after verifying that there are no under-replicated partitions on the broker before proceeding to the next broker.
Restarting the brokers one at a time allows for software upgrades, broker configuration updates, or cluster maintenance while maintaining high availability by avoiding downtime.
The structured definition or blueprint used to describe the format and structure event messages sent through the Kafka event streaming platform.
Schemas are used to validate the structure of data in event messages and ensures that producers and consumers are sending and receiving data in the same format. Schemas are defined in the Schema Registry.
- Schema Registry¶
A centralized repository for managing and validating schemas for topic message data that stores and manages schemas for Kafka topics. Schema Registry is built into Confluent Cloud as a managed service, available with the “Advanced” Stream Governance package, and offered as part of Confluent Enterprise for self-managed deployments.
The Schema Registry is a RESTful service that stores and manages schemas for Kafka topics. The Schema Registry is integrated with Kafka and Connect to provide a central location for managing schemas and validating data. Producers and consumers to Kafka topics use schemas to ensure data consistency and compatibility as schemas evolve. Schema Registry is a key component of Stream Governance.
- Confluent Cloud: Manage Schemas in Confluent Cloud
- Confluent Platform: Schema Registry Overview
- service account¶
A Confluent Cloud non-human identity or principal used by an application or service to access resources or perform operations.
Because a service account is an identity independent of the user who created it, it can be used programmatically to authenticate to resources and perform operations without the need for a user to be logged in.
- service quota¶
The limit, or maximum value, for a specific Confluent Cloud resource or operation that might vary by the resource scope it applies to.
- sink connector¶
- A Kafka Connect connector that publishes (writes) data from a Kafka topic to an external system.
- source connector¶
- A Kafka Connect connector that subscribes (reads) data from a source (external system), extracts the payload and schema of the data, and publishes (writes) the data to Kafka topics.
- Stream Designer¶
A graphical tool that lets you visually build streaming data pipelines powered by Apache Kafka.
- Stream Governance¶
A collection of tools and features that provide data governance for data in motion. These include data quality tools such as Schema Registry, schema validation, and schema linking; built-in data catalog capabilities to classify, organize, and find event streams across systems; and stream lineage to visualize complex data relationships and uncover insights with interactive, end-to-end maps of event streams.
Taken together, these and other governance tools enable teams to manage the availability, integrity, and security of data used across organizations, and help with standardization, monitoring, collaboration, reporting, and more.
- stream lineage¶
The life cycle, or history, of data, including its origins, tranformations, and consumption, as it moves through various stages in data pipelines, applications, and systems.
Stream lineage provides a record of data’s journey from its source to its destination, and is used to track data quality, data governance, and data security.
- stream processing¶
The method of collecting event stream data in real-time as it arrives, transforming the data in real-time using operations (such as filters, joins, and aggregations), and publishing the results to one or more target systems.
Stream processing can be used to analyze data continuously, build data pipelines, and process time-sensitive data in real-time. Using the Confluent event streaming platform, event streams can be processed in real-time using Kafka Streams, Kafka Connect, or ksqlDB.
- Streams API¶
The Kafka API that allows you to build streaming applications and microservices that transform (for example, filter, group, aggregate, join) incoming event streams in real-time to Kafka topics stored in a Kafka cluster.
The Streams API is used by stream processing clients to process data in real-time, analyze data continuously, and build data pipelines.
- under replication¶
When the number of in-sync replicas is below the number of all replicas.
Under Replicated partitions can occur when a broker is down or cannot replicate fast enough from the leader (replica fetcher lag).
- user account¶
An account representing the identity of a person who can be authenticated and granted access to Confluent Cloud resources.