Kafka Connect Cluster Sizing for Confluent Platform connectors

This guide outlines best practices for sizing Kafka Connect clusters on Confluent Platform, focusing on the most widely used Confluent self-managed connectors. It covers general sizing principles, resource recommendations, and connector-specific requirements.

Sizing a Kafka Connect cluster depends on your specific workload, data formats, and the connectors you use. This guide provides recommendations for the top connectors to help you size your Confluent Platform Connect cluster.

General sizing principles

  • Throughput and task parallelism: Create a Connect cluster to ensure a stable and reliable operation of all connectors and tasks running on Connect workers. If you want a higher throughput, increase the number of tasks and increase the size of the cluster to support the connectors that run at the scale that you need.

  • CPU and memory: Most Connect worker nodes use 4-8 CPU cores and 8-32 GB of RAM. Memory requirements depend on the connector type and workload, especially for connectors that buffer large transactions or handle large messages.

  • Heap size: The heap memory requirements for Connect workers typically ranges from 0.5-4 GB. Connectors like the S3 Sink or Oracle CDC Source may need more, especially with large transactions or LOB support.

  • Tasks per worker: A common heuristic is to run two tasks per CPU core, but this can vary based on the connector’s resource usage.

  • High availability: For production environments, deploy at least two Connect workers for high availability. An N+1 sizing strategy is recommended for failover.

  • Disk: Disk space is typically only required for the installation and for logging. For most connectors, 50 GB of disk space per worker is sufficient.

Sizing for top connectors

Important

The following table provides baseline recommendations. Always:

  • Benchmark with your actual workload and adjust as needed.

  • Test your connectors with sample data and workloads and use the system configuration that best works for your use case.

Connector

Typical Use/Notes

CPU (per worker)

Memory (per worker)

Heap size

Special considerations

JDBC Source/Sink

RDBMS integration

4-8 cores

8-16 GB

2-4 GB

Use one task per table for large databases; high memory is needed for large result sets.

Debezium Postgres Source

Change data capture

4-8 cores

8-16 GB

2-4 GB

Supports only one task.

Elasticsearch Sink

Analytics, search

4-8 cores

8-16 GB

2-4 GB

Tune batch.size, linger.ms, and max.buffered.records for throughput optimization; large batches increase CPU usage. Monitor for Elasticsearch backpressure via task lag and flush.timeout.ms failures.

S3 Sink

Data lake offload

4-8 cores

16-32 GB

4-8 GB

Memory is used for file buffering; a large flush.size increases memory needs.

HDFS Sink

Data lake offload

4-8 cores

16-32 GB

4-8 GB

Similar to the S3 sink; memory is used for file buffering.

MongoDB Source

NoSQL CDC

4-8 cores

8-16 GB

2-4 GB

Memory is used to buffer events during snapshot as well as streaming phase.

HTTP Sink/Source

REST integration

4-8 cores

8-16 GB

2-4 GB

Tune the network and batch size.

Amazon SQS Source

Queue integration

4-8 cores

8-16 GB

2-4 GB

Batch limited to 10 msgs/poll (sqs.messages.max). Tune sqs.waittime.seconds (0-20s) for latency vs. API cost. Increase sqs.max.retries for unstable networks. Add tasks to scale throughput (multiple tasks can poll the same queue).

Azure Blob Sink

Cloud storage offload

4-8 cores

16-32 GB

4-8 GB

Memory is used for file buffering.

Salesforce Source/Sink

SaaS integration

4-8 cores

8-16 GB

2-4 GB

API rate limits may affect throughput.

IBM MQ Source/Sink

Mainframe/legacy integration

4-8 cores

8-16 GB

2-4 GB

Tune the network and batch size.

Google BigQuery Sink

Analytics

4-8 cores

8-16 GB

2-4 GB

Tune the batch size.

Connector-specific considerations

  • Oracle CDC Source: Memory usage can increase with large transactions or LOB columns. Allocate more memory and CPU for high-throughput workloads or large table snapshots. Parallel snapshotting (one task per table partition) can also increase resource needs.

  • S3/HDFS Sink: These connectors use memory to buffer records before writing files to storage. Large flush.size or part.size settings increase memory requirements, so monitor heap usage and adjust accordingly.

  • Replicator: Sizing for replicator depends on the number of partitions. A general guideline is one task per 300 partitions and one core per task. For example, a replication workload of 900 partitions requires at least three tasks and three CPU cores.

  • Debezium connectors: Some Debezium connectors, like the one for PostgreSQL, support only a single task. For these connectors, you can:

    • Scale vertically: Add more CPU and memory.

    • Scale horizontally: Divide the tables across different connectors using different replications.

  • JDBC Source: Each task can handle one or more tables. For large databases, use more tasks and ensure the workers have sufficient memory to handle the result sets.

Best practices

  • Benchmarking: Always test your connectors with representative data and workloads. Use a single task to determine a baseline throughput, then scale out the number of tasks.

    Note

    • This point discusses task scaling and not how you should scale a cluster.

    • A connector’s throughput may not always scale linearly with the number of tasks. For example, if one topic partition carries significantly more load than the others, then increasing the number of tasks may not improve a sink connector’s performance.

  • Monitoring: Use JMX metrics and the Connect REST API to monitor task status, throughput, and resource usage. Watch for task failures, consumer lag, and heap pressure.

  • High availability: Deploy at least two workers for high availability. For critical workloads, use an N+1 sizing model.

  • Kubernetes and cloud: When deploying on Kubernetes, set resource requests and limits for each pod. A good starting point is 4-8 CPU cores and 8-16 GB of RAM per Connect pod.

FAQs

How can I detect an undersized cluster?

Monitoring is key to ensuring your Connect cluster is sized correctly. Key indicators of an undersized cluster include:

  • Task failures or frequent restarts: Often indicates resource exhaustion.

  • High CPU or memory utilization: Workers consistently running at >80% CPU or memory are likely overloaded.

  • Heap pressure and GC pauses: Heap usage consistently above 70-80% or long garbage collection pauses can degrade performance.

If you observe these symptoms, scale up by increasing the number of tasks, adding more CPU or memory, or scale out by adding more workers.

Why am I getting low throughput despite having a large enough cluster?

You can encounter these issues as side effects of resource exhaustion despite having a well-provisioned cluster:

  • Slow throughput: Throughput is significantly below your benchmarked rates.

  • High task lag: Signals that the connector cannot keep up with the data source.

You can scale out the cluster to potentially reduce the task lag and increase throughput.

Note

However, scaling the connectors up may not always help. Ensure that there are no issues with the system that is running the connector.

If high task lag and slow throughput are not accompanied by high resource utilization, a change in cluster size may not be necessary.