Kafka Connect Cluster Sizing for Confluent Platform connectors
This guide outlines best practices for sizing Kafka Connect clusters on Confluent Platform, focusing on the most widely used Confluent self-managed connectors. It covers general sizing principles, resource recommendations, and connector-specific requirements.
Sizing a Kafka Connect cluster depends on your specific workload, data formats, and the connectors you use. This guide provides recommendations for the top connectors to help you size your Confluent Platform Connect cluster.
General sizing principles
Throughput and task parallelism: Create a Connect cluster to ensure a stable and reliable operation of all connectors and tasks running on Connect workers. If you want a higher throughput, increase the number of tasks and increase the size of the cluster to support the connectors that run at the scale that you need.
CPU and memory: Most Connect worker nodes use 4-8 CPU cores and 8-32 GB of RAM. Memory requirements depend on the connector type and workload, especially for connectors that buffer large transactions or handle large messages.
Heap size: The heap memory requirements for Connect workers typically ranges from 0.5-4 GB. Connectors like the S3 Sink or Oracle CDC Source may need more, especially with large transactions or LOB support.
Tasks per worker: A common heuristic is to run two tasks per CPU core, but this can vary based on the connector’s resource usage.
High availability: For production environments, deploy at least two Connect workers for high availability. An N+1 sizing strategy is recommended for failover.
Disk: Disk space is typically only required for the installation and for logging. For most connectors, 50 GB of disk space per worker is sufficient.
Sizing for top connectors
Important
The following table provides baseline recommendations. Always:
Benchmark with your actual workload and adjust as needed.
Test your connectors with sample data and workloads and use the system configuration that best works for your use case.
Connector | Typical Use/Notes | CPU (per worker) | Memory (per worker) | Heap size | Special considerations |
|---|---|---|---|---|---|
JDBC Source/Sink | RDBMS integration | 4-8 cores | 8-16 GB | 2-4 GB | Use one task per table for large databases; high memory is needed for large result sets. |
Debezium Postgres Source | Change data capture | 4-8 cores | 8-16 GB | 2-4 GB | Supports only one task. |
Elasticsearch Sink | Analytics, search | 4-8 cores | 8-16 GB | 2-4 GB | Tune |
S3 Sink | Data lake offload | 4-8 cores | 16-32 GB | 4-8 GB | Memory is used for file buffering; a large |
HDFS Sink | Data lake offload | 4-8 cores | 16-32 GB | 4-8 GB | Similar to the S3 sink; memory is used for file buffering. |
MongoDB Source | NoSQL CDC | 4-8 cores | 8-16 GB | 2-4 GB | Memory is used to buffer events during snapshot as well as streaming phase. |
HTTP Sink/Source | REST integration | 4-8 cores | 8-16 GB | 2-4 GB | Tune the network and batch size. |
Amazon SQS Source | Queue integration | 4-8 cores | 8-16 GB | 2-4 GB | Batch limited to 10 msgs/poll ( |
Azure Blob Sink | Cloud storage offload | 4-8 cores | 16-32 GB | 4-8 GB | Memory is used for file buffering. |
Salesforce Source/Sink | SaaS integration | 4-8 cores | 8-16 GB | 2-4 GB | API rate limits may affect throughput. |
IBM MQ Source/Sink | Mainframe/legacy integration | 4-8 cores | 8-16 GB | 2-4 GB | Tune the network and batch size. |
Google BigQuery Sink | Analytics | 4-8 cores | 8-16 GB | 2-4 GB | Tune the batch size. |
Connector-specific considerations
Oracle CDC Source: Memory usage can increase with large transactions or LOB columns. Allocate more memory and CPU for high-throughput workloads or large table snapshots. Parallel snapshotting (one task per table partition) can also increase resource needs.
S3/HDFS Sink: These connectors use memory to buffer records before writing files to storage. Large
flush.sizeorpart.sizesettings increase memory requirements, so monitor heap usage and adjust accordingly.Replicator: Sizing for replicator depends on the number of partitions. A general guideline is one task per 300 partitions and one core per task. For example, a replication workload of 900 partitions requires at least three tasks and three CPU cores.
Debezium connectors: Some Debezium connectors, like the one for PostgreSQL, support only a single task. For these connectors, you can:
Scale vertically: Add more CPU and memory.
Scale horizontally: Divide the tables across different connectors using different replications.
JDBC Source: Each task can handle one or more tables. For large databases, use more tasks and ensure the workers have sufficient memory to handle the result sets.
Best practices
Benchmarking: Always test your connectors with representative data and workloads. Use a single task to determine a baseline throughput, then scale out the number of tasks.
Note
This point discusses task scaling and not how you should scale a cluster.
A connector’s throughput may not always scale linearly with the number of tasks. For example, if one topic partition carries significantly more load than the others, then increasing the number of tasks may not improve a sink connector’s performance.
Monitoring: Use JMX metrics and the Connect REST API to monitor task status, throughput, and resource usage. Watch for task failures, consumer lag, and heap pressure.
High availability: Deploy at least two workers for high availability. For critical workloads, use an N+1 sizing model.
Kubernetes and cloud: When deploying on Kubernetes, set resource requests and limits for each pod. A good starting point is 4-8 CPU cores and 8-16 GB of RAM per Connect pod.
FAQs
What are the recommended limits for a minimally sized cluster?
A minimal Connect cluster (2 workers, 4 CPU cores, and 8-16 GB RAM each) is suitable for development, testing, or low-throughput production workloads.
Data change rate: A minimal cluster can typically handle up to 5-10 MBps of sustained throughput, or about 10,000-50,000 events per second.
Source database size: For CDC connectors, a minimal cluster is best for source databases up to a few hundred GB with moderate change rates (<10,000 row changes/sec).
- Connector examples:
JDBC: Suitable for small-to-medium databases.
Oracle CDC/Debezium: Best for databases with moderate transaction volumes and no large LOB columns.
S3/HDFS Sink: Works well for low-frequency batch offloads.
How can I detect an undersized cluster?
Monitoring is key to ensuring your Connect cluster is sized correctly. Key indicators of an undersized cluster include:
Task failures or frequent restarts: Often indicates resource exhaustion.
High CPU or memory utilization: Workers consistently running at >80% CPU or memory are likely overloaded.
Heap pressure and GC pauses: Heap usage consistently above 70-80% or long garbage collection pauses can degrade performance.
If you observe these symptoms, scale up by increasing the number of tasks, adding more CPU or memory, or scale out by adding more workers.
Why am I getting low throughput despite having a large enough cluster?
You can encounter these issues as side effects of resource exhaustion despite having a well-provisioned cluster:
Slow throughput: Throughput is significantly below your benchmarked rates.
High task lag: Signals that the connector cannot keep up with the data source.
You can scale out the cluster to potentially reduce the task lag and increase throughput.
Note
However, scaling the connectors up may not always help. Ensure that there are no issues with the system that is running the connector.
If high task lag and slow throughput are not accompanied by high resource utilization, a change in cluster size may not be necessary.