Kafka Connect on Confluent Cloud

Kafka Connect, an open source component of Apache Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems.

Using Kafka Connect you can use existing connector implementations for common data sources and sinks to move data into and out of Kafka.

Source Connector
A source connector ingests entire databases and streams table updates to Kafka topics. It can also collect metrics from all of your application servers into Kafka topics, making the data available for stream processing with low latency.
Sink Connector
A sink connector delivers data from Kafka topics into secondary indexes such as Elasticsearch or batch systems such as Hadoop for offline analysis.

Kafka Connect is focused on streaming data to and from Kafka, making it simpler for you to write high quality, reliable, and high performance connector plugins. It also enables the framework to make guarantees that are difficult to achieve using other frameworks. Kafka Connect is an integral component of an ETL pipeline when combined with Kafka and a stream processing framework.

Kafka Connect can run either as a standalone process for running jobs on a single machine (e.g., log collection), or as a distributed, scalable, fault tolerant service supporting an entire organization. This allows it to scale down to development, testing, and small production deployments with a low barrier to entry and low operational overhead, and to scale up to support a large organization's data pipeline.

The main benefits of using Kafka Connect are:

  • Data Centric Pipeline -- use meaningful data abstractions to pull or push data to Kafka.
  • Flexibility and Scalability -- run with streaming and batch-oriented systems on a single node or scaled to an organization-wide service.
  • Reusability and Extensibility -- leverage existing connectors or extend them to tailor to your needs and lower time to production.

Confluent Cloud S3 Connector (Sink)

Important

This connector is a preview connector for Confluent Cloud. Preview connectors are not currently supported and are not recommended for production use. For specific connector limitations, see Confluent Cloud Connect Preview.

You can use the S3 connector, currently available as a sink, to export data from Kafka topics to S3 objects in either Avro or JSON formats. Depending on your environment, the S3 connector can export data by guaranteeing exactly-once delivery semantics to consumers of the S3 objects it produces.

The S3 sink connector periodically polls data from Kafka and in turn uploads it to S3. A partitioner is used to split the data of every Kafka partition into chunks. Each chunk of data is represented as an S3 object. The key name encodes the topic, the Kafka partition, and the start offset of this data chunk. If no partitioner is specified in the configuration, the default partitioner which preserves Kafka partitioning is used. The size of each data chunk is determined by the number of records written to S3 and by schema compatibility.

Features

The preview Confluent Cloud S3 connector provides the following features:

  • Exactly Once Delivery: Records that are exported using a deterministic partitioner are delivered with exactly-once semantics regardless of the eventual consistency of S3.
  • Data Format with or without Schema: Out of the box, the connector supports writing data to S3 in Avro, JSON, and Bytes. Schema validation is disabled for JSON.
  • Schema Evolution: schema.compatibility is set to NONE.
  • Partitioner: The connector supports the TimeBasedPartitioner class based on the Kafka class TimeStamp. flush.size is set to 1000.

Quick Start

This quick start shows you how to get up and running with the Confluent Cloud S3 connector (sink). This quick start provides the basics of selecting the connector and configuring it to stream events to an Amazon Web Services (AWS) S3 bucket.

Prerequisites

Step 1: Launch your Confluent Cloud cluster.

See the Confluent Cloud Quick Start for installation instructions.

Step 2: Add a connector.

Click Connectors > Add connector.

../../_images/ccloud-add-connector.png

Step 3: Select the Amazon S3 Data Sink.

../../_images/ccloud-select-s3.png

Step 4: Enter the connector details.

Complete the following and click Continue.

  1. Enter a Connector Name.
  2. Enter your Kafka Cluster credentials. You create these keys when configuring the Confluent Cloud CLI.
  3. Enter one or more Topic names. Separate multiple topic names with a comma.
../../_images/ccloud-s3-details.png

Step 5: Enter the destination details.

Complete the following and click Continue.

Important

Your AWS credentials and bucket name are validated here. Make sure you enter these correctly.

  1. Enter your AWS credentials.
  2. Enter the S3 bucket name.
  3. Select the message format. Note that you have to have Confluent Cloud Schema Registry configured if using a schema-based message format (like Avro). See Managing Schemas for Topics in Confluent Cloud.
  4. Select the Time interval that sets how you want your messages grouped in the S3 bucket. For example, if you select Hourly, messages are grouped into folders for each hour data is streamed to the bucket.
  5. Enter the number of tasks in use by the connector. Do not enter a value that exceeds the Max number displayed.
../../_images/ccloud-bucket-destination.png

Step 6: Launch the connector.

Verify the following and click Launch.

  1. Make sure your data is going to the correct bucket.
  2. Check that the last directory in the path shown is using the Time Interval you entered earlier.
../../_images/ccloud-launch-connector.png

Step 7: Check the connector status.

Click the dots next to the connector name and select Status.

../../_images/ccloud-s3-status.png

The connector should show that it is successfully streaming events to your AWS S3 bucket.

../../_images/ccloud-s3-status-details.png

Step 8: Check the S3 bucket.

  1. Go to the AWS Management Console and select Storage > S3.
  2. Open your S3 bucket.
  3. Open your topic folder and each subsequent folder until you see your messages displayed.
../../_images/ccloud-aws-s3-details.png

For additional information about the S3 connector see Kafka Connect S3. Note that not all Confluent Platform S3 connector features are provided in the Confluent Cloud preview S3 connector.

Next Steps

Try out the Confluent Cloud Demo using the Confluent Cloud S3 sink connector and an AWS S3 bucket.

This demo consists of setting up two Kafka clusters (one local and one Confluent Cloud) and using two instances of the kakfa-connect-datagen source connector to produce mock data. One Confluent Replicator instance is used to write data from the local cluster to the Confluent Cloud cluster. Kafka data is written to the cluster in Avro format.