Monitor Confluent Cloud Clients

When developing Apache Kafka® applications on Confluent Cloud, it is important to monitor the health of your deployment to maintain optimal performance. Performance monitoring helps to achieve this by providing a quantitative account of the health of each Kafka or Confluent component.

Since production environments are dynamic–that is, data profiles may change, you may add new clients and enable new features–performance monitoring is crucial. Ongoing monitoring is as much about identifying and responding to potential failures as it is about ensuring the services goals are consistently met even as the production environment changes. Before going into production, you should have a robust monitoring system for all the producers, consumers, topics, and any other Kafka or Confluent components you’re using.

The information in this page helps you configure a robust performance monitoring system when working with Confluent Cloud.

Metrics API

The Confluent Cloud Metrics provides programmatic access to actionable metrics for your Confluent Cloud deployment, including server-side metrics for the Confluent-managed services. However, the Metrics API does not allow you to get client-side metrics. To retrieve client-side metrics, see Producers and Consumers.

The Metrics API is enabled by default, and aggregates metrics at the topic and cluster level. Any authorized user can gain access to the metrics, which allows you to monitor overall usage and performance. To get started with the Metrics API, see the Confluent Cloud Metrics documentation.

You can use the Metrics API to query metrics at the following granularities (other resolutions are available if needed):

  • Bytes produced per minute grouped by topic
  • Bytes consumed per minute grouped by topic
  • Max retained bytes per hour over two hours for a given topic
  • Max retained bytes per hour over two hours for a given cluster

You can retrieve the metrics easily over the internet using HTTPS, capturing them at regular intervals to get a time series and an operational view of cluster performance. You can integrate the metrics into any cloud provider monitoring tools like Azure Monitor, Google Cloud’s operations suite (formerly Stackdriver), or Amazon CloudWatch, or into existing monitoring systems like Prometheus and Datadog, and then plot them in a time series graph to see usage over time. When writing your own application to use the Metrics API, see the full API specification to use advanced features.

Client JMX metrics

Kafka applications expose some internal Java Management Extensions (JMX) metrics, and many users run JMX exporters to feed the metrics into their monitoring systems. You can retrieve JMX metrics for your client applications and the services you manage (though not for the Confluent-managed services, which are not directly exposed to users) by starting your Kafka client applications with the JMX_PORT environment variable configured. There are many Kafka-internal metrics that are exposed through JMX to provide insight on the performance of your applications.

Producers

Throttling

Depending on your Confluent Cloud service plan, you may be limited to certain throughput rates for produce (write). If your client applications exceed the produce rates, the quotas on the brokers detect it and the brokers throttle the client application requests. It’s important to ensure your applications aren’t consuming more resources than they should be. If the brokers throttle the client application requests, consider the following two options:

  • Make modifications to the application to optimize its throughput. For more information on how to optimize throughput, see Optimize Confluent Cloud Clients for Throughput.
  • Choose a cluster configuration with higher limits. In Confluent Cloud, you can choose from Standard and Dedicated clusters (customizable for higher limits). The Metrics API can give you some indication of throughput from the server side, but it doesn’t provide throughput metrics on the client side. For more information about cluster limits, see Kafka Cluster Types in Confluent Cloud.

To get throttling metrics per producer, monitor the following client JMX metrics:

Metric Description
kafka.producer:type=producer-metrics,client-id=([-.w]+),name=produce-throttle-time-avg The average time in ms that a request was throttled by a broker
kafka.producer:type=producer-metrics,client-id=([-.w]+),name=produce-throttle-time-max The maximum time in ms that a request was throttled by a broker

User processes

To further tune the performance of your producer, monitor the producer time spent in user processes if the producer has non-blocking code to send messages. Using the following io-ratio and io-wait-ratio metrics where user processing time is the fraction of time not spent in either of these. If time in these are low, then the user processing time may be high, which keeps the single producer I/O thread busy. For example, you can check if the producer is using any callbacks, which are invoked when messages have been acknowledged and run in the I/O thread:

Metric Description
kafka.producer:type=producer-metrics,client-id=([-.w]+),name=io-ratio Fraction of time that the I/O thread spent doing I/O
kafka.producer:type=producer-metrics,client-id=([-.w]+),name=io-wait-ratio Fraction of time that the I/O thread spent waiting

Consumers

Throttling

As mentioned earlier with producers, depending on your Confluent Cloud service plan, you may be limited to certain throughput rates for consume (read). If your client applications exceed these consume rates, consider the two options mentioned earlier.

To get throttling metrics per consumer, monitor the following client JMX metrics:

Metric Description
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-avg The average time in ms that a broker spent throttling a fetch request
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-max The maximum time in ms that a broker spent throttling a fetch request

Consumer lag

You should also monitor your application’s consumer lag, which is the number of records for any partition that the consumer is behind in the log. This metric is particularly important for real-time consumer applications where the consumer should be processing the newest messages with as low latency as possible. Monitoring consumer lag can indicate whether the consumer is able to fetch records fast enough from the brokers. Also consider how the offsets are committed. For example, exactly-once semantics (EOS) provide stronger guarantees while potentially increasing consumer lag. You can monitor consumer lag from the Confluent Cloud user interface, as described in the Monitor Kafka Consumer Lag in Confluent Cloud documentation. If you are capturing JMX metrics, you can monitor records-lag-max:

Metric Description
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),records-lag-max The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group isn’t keeping up with the producers.

Client librdkafka metrics

The librdkafka library provides advanced Kafka telemetry that can be used to monitor the performance of Kafka producers and consumers. Similar to JMX metrics, you can configure the librdkafka library to emit internal metrics at regular intervals, which can then be fed into your monitoring systems. While JMX metrics provide high-level statistics for Kafka consumers and producers, the librdkafka metrics provide telemetry at the broker and topic-partition level. For more information, see statistics metrics in the librdkafka documentation.

...
 "throttle": {
   "min": 0,
   "max": 0,
   "avg": 0,
   "sum": 0,
   "stddev": 0,
   "p50": 0,
   "p75": 0,
   "p90": 0,
   "p95": 0,
   "p99": 0,
   "p99_99": 0,
   "outofrange": 0,
   "hdrsize": 17520,
   "cnt": 0
 }
...
Metric Description
throttle Top level object that contains rolling window statistics for broker throttling in milliseconds.
min The smallest value.
max The largest value.
avg The average value.
sum The sum of values.
stddev The standard deviation (based on histogram).
p50 50th percentile.
p75 75th percentile.
p90 90th percentile.
p99 99th percentile.
p99_99 99.99th percentile.
outofrange Values skipped due to out of histogram range.
hdrsize Memory size of Hdr Histogram.
cnt Number of values sampled.