Dedicated Cluster Performance and Expansion in Confluent Cloud

There are a number of factors that can affect the overall performance of your Apache Kafka® applications running on Confluent Cloud Dedicated clusters. You should monitor the cluster load along with consumer lag, client throttling, and producer latency for clues to how you can improve performance in your applications, and to determine whether cluster expansion is the right solution.

For more about monitoring your applications, see Monitoring Your Event Streams: Tutorial for Observability Into Apache Kafka Clients (blog).

Interpret a high cluster load value

Cluster load is a percentage value between 0-100, with 0% indicating no load on the cluster, and 100% representing a fully-saturated cluster. Higher load values on a cluster commonly result in higher latencies and/or client throttling for your application. You should expect higher latency and some degree of throttling if the cluster load is greater than 80%.

You could expand your cluster in an attempt to lower the load, but before you do so you should look at the time-series graph for cluster load to obtain a historical perspective of the cluster load variation. For details on how to access cluster load, see Cluster load metric.

../_images/ccloud-cluster-load.png

Cluster load example

When viewing this graph consider that a cluster load of 70% may be acceptable if it is an occasional spike or a normal high point for a workload. However, a load of 70% may be too high if the cluster needs additional capacity to accommodate load spikes due to variations in application workload patterns, or if new workloads will be added to the cluster. In this case, expanding the cluster is probably the right solution.

Generally, expanding your dedicated cluster provides more capacity for your workloads, and in many cases, will help improve the performance of your Kafka applications. In addition, a lower cluster load can help improve latency for your applications.

If you expand a dedicated cluster, and the expansion does not resolve the performance issues, you can shrink the cluster back to its original size.

Review high consumer lag

The Consumer lag metric indicates the number of records for any partition that the consumer is behind in the log. If the rate of production of data exceeds the rate at which it is getting consumed, consumer groups will measure lag. An increase in consumer lag can indicate a client-side issue, a Kafka server-side issue, or both.

../_images/cloud-consumer-lag-detail.png

Consumer lag example

To identify if consumer lag is increasing because of server-side issues, first use the cluster load metric to confirm your cluster isn’t overloaded. If the cluster load metric indicates your cluster is heavily loaded (> 70%), it is safe to assume that expanding your cluster will help with consumer lag.

If the cluster is not heavily loaded, next look at the number of partitions on the cluster. Partitions parallelize the workload across your cluster. As a rule of thumb, you should have somewhere between 6-10 partitions at a minimum (per CKU) to get the benefits of parallelization, although your specific workload may warrant a higher or lower number of partitions. For more information on improving parallelization, see Optimize and Tune Confluent Cloud Clients.

If your cluster is not heavily loaded and you have the recommended number of partitions or more on your cluster, then consumer lag may be increasing because there is not sufficient parallelism in your consumer application. Adding consumers may help resolve this issue.

Review client application throttles

Throttles are a normal part of working with cloud services. Confluent Cloud clusters throttle client applications if they exceed the rate the cluster is configured to handle based on its allocated capacity. This throttling prevents additional usage that could cause a cluster outage, which might be catastrophic. Throttles are negotiated between the Kafka server and Kafka Consumers and Producers, ensuring the clients wait a sufficient amount of time to ensure the server can handle the request without compromising uptime.

For more information about client side and producer and consumer metrics, which provide visibility into whether your producers and/or consumers are being throttled, see and Client Monitoring and the discussion of produce-throttle-time-avg and produce-throttle-time-max in the Producer Metrics section of the Confluent Platform documentation.

Determine if throttling is caused by server or client issues

To determine if the client throttling is a client side or a server side issue, again, refer to the Monitor dedicated cluster load. If the cluster load metric indicates that your cluster is under a high or increasing load, you can reasonably assume that cluster expansion will mitigate throttles.

Other server-side characteristics that are important to evaluate as specific causes of throttling are ingress, egress, and request rate.

Remember that the throughput and request rate for a dedicated cluster are limited by the number of CKUs allocated to the cluster. If your applications are consuming more throughput and making more requests than your cluster can currently handle, expanding your cluster will likely resolve the issue.

However, if your cluster is being throttled and you cannot identify a server-side reason, there could be client-side implementations that are causing throttling. For example, your workload may be unbalanced, meaning that one or few partitions are sustaining the vast majority of traffic, leaving the remainder of the cluster underutilized. Confluent Cloud constantly monitors the balance of your cluster to automatically optimize the distribution of your workload. In some cases, Confluent Cloud might not be able to find an optimal balance due to client-side access patterns, such as unbalanced partition assignment strategies.

To ensure the best possible experience for interacting with Kafka, architect your client-side applications appropriately to ensure the best throughput potential is recommended.

Monitor latency in producer applications

Certain metrics can also indicate that the producers are experiencing latency. Specifically, the buffer-available-bytes(=0), and / or increases in bufferpool-wait-time could be an indication of producers experiencing latency.

Monitor active_connection_count. Benchmarking shows that exceeding the number of total client connections per-CKU often leads to an exponential increase in produce latency. For more information, see the total client connections dimension in Dimensions with a recommended guideline table.

If the cluster load metric is high and the producer buffer is high, it is likely that expanding the cluster by adding CKUs will improve producer latency. If latency is not improved after cluster expansion, see the previous section on Client applications.