Kafka Connect Cluster Sizing for Confluent Platform connectors

This guide outlines best practices for sizing Kafka Connect clusters on Confluent Platform, focusing on the most widely used Confluent self-managed connectors. It covers general sizing principles, resource recommendations, and connector-specific requirements.

Sizing a Kafka Connect cluster depends on your specific workload, data formats, and the connectors you use. This guide provides recommendations for the top connectors to help you size your Confluent Platform Connect cluster.

General sizing principles

Throughput and task parallelism: Create a Connect cluster to ensure a stable and reliable operation of all connectors and tasks running on Connect workers. If you want a higher throughput, increase the number of tasks and increase the size of the cluster to support the connectors that run at the scale that you need.
CPU and memory: Most Connect worker nodes use 4-8 CPU cores and 8-32 GB of RAM. Memory requirements depend on the connector type and workload, especially for connectors that buffer large transactions or handle large messages.
Heap size: The heap memory requirements for Connect workers typically ranges from 0.5-4 GB. Connectors like the S3 Sink or Oracle CDC Source may need more, especially with large transactions or LOB support.
Tasks per worker: A common heuristic is to run two tasks per CPU core, but this can vary based on the connector’s resource usage.
High availability: For production environments, deploy at least two Connect workers for high availability. An N+1 sizing strategy is recommended for failover.
Disk: Disk space is typically only required for the installation and for logging. For most connectors, 50 GB of disk space per worker is sufficient.

Sizing for top connectors

Important

The following table provides baseline recommendations. Always:

Benchmark with your actual workload and adjust as needed.
Test your connectors with sample data and workloads and use the system configuration that best works for your use case.

Connector	Typical Use/Notes	CPU (per worker)	Memory (per worker)	Heap size	Special considerations
JDBC Source/Sink	RDBMS integration	4-8 cores	8-16 GB	2-4 GB	Use one task per table for large databases; high memory is needed for large result sets.
Debezium Postgres Source	Change data capture	4-8 cores	8-16 GB	2-4 GB	Supports only one task.
Elasticsearch Sink	Analytics, search	4-8 cores	8-16 GB	2-4 GB	Tune `batch.size`, `linger.ms`, and `max.buffered.records` for throughput optimization; large batches increase CPU usage. Monitor for Elasticsearch backpressure via task lag and `flush.timeout.ms` failures.
S3 Sink	Data lake offload	4-8 cores	16-32 GB	4-8 GB	Memory is used for file buffering; a large `flush.size` increases memory needs.
HDFS Sink	Data lake offload	4-8 cores	16-32 GB	4-8 GB	Similar to the S3 sink; memory is used for file buffering.
MongoDB Source	NoSQL CDC	4-8 cores	8-16 GB	2-4 GB	Memory is used to buffer events during snapshot as well as streaming phase.
HTTP Sink/Source	REST integration	4-8 cores	8-16 GB	2-4 GB	Tune the network and batch size.
Amazon SQS Source	Queue integration	4-8 cores	8-16 GB	2-4 GB	Batch limited to 10 msgs/poll (`sqs.messages.max`). Tune `sqs.waittime.seconds` (0-20s) for latency vs. API cost. Increase `sqs.max.retries` for unstable networks. Add tasks to scale throughput (multiple tasks can poll the same queue).
Azure Blob Sink	Cloud storage offload	4-8 cores	16-32 GB	4-8 GB	Memory is used for file buffering.
Salesforce Source/Sink	SaaS integration	4-8 cores	8-16 GB	2-4 GB	API rate limits may affect throughput.
IBM MQ Source/Sink	Mainframe/legacy integration	4-8 cores	8-16 GB	2-4 GB	Tune the network and batch size.
Google BigQuery Sink	Analytics	4-8 cores	8-16 GB	2-4 GB	Tune the batch size.

Connector-specific considerations

Oracle CDC Source: Memory usage can increase with large transactions or LOB columns. Allocate more memory and CPU for high-throughput workloads or large table snapshots. Parallel snapshotting (one task per table partition) can also increase resource needs.
S3/HDFS Sink: These connectors use memory to buffer records before writing files to storage. Large flush.size or part.size settings increase memory requirements, so monitor heap usage and adjust accordingly.
Replicator: Sizing for replicator depends on the number of partitions. A general guideline is one task per 300 partitions and one core per task. For example, a replication workload of 900 partitions requires at least three tasks and three CPU cores.
Debezium connectors: Some Debezium connectors, like the one for PostgreSQL, support only a single task. For these connectors, you can:
- Scale vertically: Add more CPU and memory.
- Scale horizontally: Divide the tables across different connectors using different replications.
JDBC Source: Each task can handle one or more tables. For large databases, use more tasks and ensure the workers have sufficient memory to handle the result sets.

Best practices

Benchmarking: Always test your connectors with representative data and workloads. Use a single task to determine a baseline throughput, then scale out the number of tasks.
Note
- This point discusses task scaling and not how you should scale a cluster.
- A connector’s throughput may not always scale linearly with the number of tasks. For example, if one topic partition carries significantly more load than the others, then increasing the number of tasks may not improve a sink connector’s performance.
Monitoring: Use JMX metrics and the Connect REST API to monitor task status, throughput, and resource usage. Watch for task failures, consumer lag, and heap pressure.
High availability: Deploy at least two workers for high availability. For critical workloads, use an N+1 sizing model.
Kubernetes and cloud: When deploying on Kubernetes, set resource requests and limits for each pod. A good starting point is 4-8 CPU cores and 8-16 GB of RAM per Connect pod.

FAQs

What are the recommended limits for a minimally sized cluster?

A minimal Connect cluster (2 workers, 4 CPU cores, and 8-16 GB RAM each) is suitable for development, testing, or low-throughput production workloads.

Data change rate: A minimal cluster can typically handle up to 5-10 MBps of sustained throughput, or about 10,000-50,000 events per second.
Source database size: For CDC connectors, a minimal cluster is best for source databases up to a few hundred GB with moderate change rates (<10,000 row changes/sec).
Connector examples:
- JDBC: Suitable for small-to-medium databases.
- Oracle CDC/Debezium: Best for databases with moderate transaction volumes and no large LOB columns.
- S3/HDFS Sink: Works well for low-frequency batch offloads.

How can I detect an undersized cluster?

Monitoring is key to ensuring your Connect cluster is sized correctly. Key indicators of an undersized cluster include:

Task failures or frequent restarts: Often indicates resource exhaustion.
High CPU or memory utilization: Workers consistently running at >80% CPU or memory are likely overloaded.
Heap pressure and GC pauses: Heap usage consistently above 70-80% or long garbage collection pauses can degrade performance.

If you observe these symptoms, scale up by increasing the number of tasks, adding more CPU or memory, or scale out by adding more workers.

Why am I getting low throughput despite having a large enough cluster?

You can encounter these issues as side effects of resource exhaustion despite having a well-provisioned cluster:

Slow throughput: Throughput is significantly below your benchmarked rates.
High task lag: Signals that the connector cannot keep up with the data source.

You can scale out the cluster to potentially reduce the task lag and increase throughput.

Note

However, scaling the connectors up may not always help. Ensure that there are no issues with the system that is running the connector.

If high task lag and slow throughput are not accompanied by high resource utilization, a change in cluster size may not be necessary.