Kafka Connect on Confluent Cloud¶
Kafka Connect, an open source component of Apache Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems.
Using Kafka Connect you can use existing connector implementations for common data sources and sinks to move data into and out of Kafka.
- Source Connector
- A source connector ingests entire databases and streams table updates to Kafka topics. It can also collect metrics from all of your application servers into Kafka topics, making the data available for stream processing with low latency.
- Sink Connector
- A sink connector delivers data from Kafka topics into secondary indexes such as Elasticsearch or batch systems such as Hadoop for offline analysis.
Kafka Connect is focused on streaming data to and from Kafka, making it simpler for you to write high quality, reliable, and high performance connector plugins. It also enables the framework to make guarantees that are difficult to achieve using other frameworks. Kafka Connect is an integral component of an ETL pipeline when combined with Kafka and a stream processing framework.
Kafka Connect can run either as a standalone process for running jobs on a single machine (e.g., log collection), or as a distributed, scalable, fault tolerant service supporting an entire organization. This allows it to scale down to development, testing, and small production deployments with a low barrier to entry and low operational overhead, and to scale up to support a large organization's data pipeline.
The main benefits of using Kafka Connect are:
- Data Centric Pipeline -- use meaningful data abstractions to pull or push data to Kafka.
- Flexibility and Scalability -- run with streaming and batch-oriented systems on a single node or scaled to an organization-wide service.
- Reusability and Extensibility -- leverage existing connectors or extend them to tailor to your needs and lower time to production.
Confluent Cloud S3 Connector (Sink)¶
This connector is a preview connector for Confluent Cloud. Preview connectors are not currently supported and are not recommended for production use. For specific connector limitations, see Confluent Cloud Connect Preview.
You can use the S3 connector, currently available as a sink, to export data from Kafka topics to S3 objects in either Avro or JSON formats. Depending on your environment, the S3 connector can export data by guaranteeing exactly-once delivery semantics to consumers of the S3 objects it produces.
The S3 sink connector periodically polls data from Kafka and in turn uploads it to S3. A partitioner is used to split the data of every Kafka partition into chunks. Each chunk of data is represented as an S3 object. The key name encodes the topic, the Kafka partition, and the start offset of this data chunk. If no partitioner is specified in the configuration, the default partitioner which preserves Kafka partitioning is used. The size of each data chunk is determined by the number of records written to S3 and by schema compatibility.
The preview Confluent Cloud S3 connector provides the following features:
- Exactly Once Delivery: Records that are exported using a deterministic partitioner are delivered with exactly-once semantics regardless of the eventual consistency of S3.
- Data Format with or without Schema: Out of the box, the connector supports writing data to S3 in Avro, JSON, and Bytes. Schema validation is disabled for JSON.
- Schema Evolution:
schema.compatibilityis set to
- Partitioner: The connector supports the
TimeBasedPartitionerclass based on the Kafka class
flush.sizeis set to 1000.
This quick start shows you how to get up and running with the Confluent Cloud S3 connector (sink). This quick start provides the basics of selecting the connector and configuring it to stream events to an Amazon Web Services (AWS) S3 bucket.
- Authorized access to a Confluent Cloud cluster on AWS.
- An accessible AWS S3 bucket in the same region as your Confluent Cloud cluster.
- The Confluent Cloud CLI installed and configured for the cluster. See Install and Configure the Confluent Cloud CLI.
- An AWS account configured with Access Keys. You use these access keys when setting up the connector.
- (Optional) Confluent Cloud Schema Registry enabled for your cluster, if you are using a messaging schema (like Apache Avro). See Managing Schemas for Topics in Confluent Cloud.
Step 1: Launch your Confluent Cloud cluster.¶
See the Confluent Cloud Quick Start for installation instructions.
Step 2: Add a connector.¶
Click Connectors > Add connector.
Step 3: Select the Amazon S3 Data Sink.¶
Step 4: Enter the connector details.¶
Complete the following and click Continue.
- Enter a Connector Name.
- Enter your Kafka Cluster credentials. You create these keys when configuring the Confluent Cloud CLI.
- Enter one or more Topic names. Separate multiple topic names with a comma.
Step 5: Enter the destination details.¶
Complete the following and click Continue.
Your AWS credentials and bucket name are validated here. Make sure you enter these correctly.
- Enter your AWS credentials.
- Enter the S3 bucket name.
- Select the message format. Note that you have to have Confluent Cloud Schema Registry configured if using a schema-based message format (like Avro). See Managing Schemas for Topics in Confluent Cloud.
- Select the Time interval that sets how you want your messages grouped in the S3 bucket. For example, if you select Hourly, messages are grouped into folders for each hour data is streamed to the bucket.
- Enter the number of tasks in use by the connector. Do not enter a value that exceeds the Max number displayed.
Step 6: Launch the connector.¶
Verify the following and click Launch.
- Make sure your data is going to the correct bucket.
- Check that the last directory in the path shown is using the Time Interval you entered earlier.
Step 7: Check the connector status.¶
Click the dots next to the connector name and select Status.
The connector should show that it is successfully streaming events to your AWS S3 bucket.
Step 8: Check the S3 bucket.¶
- Go to the AWS Management Console and select Storage > S3.
- Open your S3 bucket.
- Open your topic folder and each subsequent folder until you see your messages displayed.
For additional information about the S3 connector see Kafka Connect S3. Note that not all Confluent Platform S3 connector features are provided in the Confluent Cloud preview S3 connector.
Try out the Confluent Cloud Demo using the Confluent Cloud S3 sink connector and an AWS S3 bucket.
This demo consists of setting up two Kafka clusters (one local and one Confluent Cloud)
and using two instances of the
kakfa-connect-datagen source connector to
produce mock data. One Confluent Replicator instance is used to write data from the local
cluster to the Confluent Cloud cluster. Kafka data is written to the cluster in Avro