Monitor Confluent Cloud Clients with JMX Monitoring

Kafka applications expose some internal Java Management Extensions (JMX) metrics, and many users run JMX exporters to feed the metrics into their monitoring systems. You can retrieve JMX metrics for your client applications and the services you manage by starting your Kafka client applications with the JMX_PORT environment variable configured. You cannot use JMX metrics for the Confluent-managed services.

Producers

Throttling

Depending on your Confluent Cloud service plan, you can be limited to certain throughput rates for produce (write). If your client applications exceed the produce rates, the quotas on the brokers detect it and the brokers throttle the client application requests. It’s important to ensure your applications aren’t consuming more resources than they should be. If the brokers throttle the client application requests, consider the following two options:

Make modifications to the application to optimize its throughput. For more information on how to optimize throughput, see Optimize Confluent Cloud Clients for Throughput.
Choose a cluster configuration with higher limits. In Confluent Cloud, you can choose from Standard and Dedicated clusters, which are customizable for higher limits. The Metrics API can give you some indication of throughput from the server side, but it doesn’t provide throughput metrics on the client side. For more information about cluster limits, see Kafka Cluster Types in Confluent Cloud.

To get throttling metrics per producer, monitor the following client JMX metrics:

Metric	Description
`kafka.producer:type=producer-metrics,client-id=([-.w]+),name=produce-throttle-time-avg`	The average time in ms that a request was throttled by a broker
`kafka.producer:type=producer-metrics,client-id=([-.w]+),name=produce-throttle-time-max`	The maximum time in ms that a request was throttled by a broker

User processes

To further tune the performance of your producer, monitor the producer time spent in user processes if the producer has non-blocking code to send messages. Using the following io-ratio and io-wait-ratio metrics where user processing time is the fraction of time not spent in either of these. If time in these are low, then the user processing time may be high, which keeps the single producer I/O thread busy. For example, you can check if the producer is using any callbacks, which are invoked when messages have been acknowledged and run in the I/O thread:

Metric	Description
`kafka.producer:type=producer-metrics,client-id=([-.w]+),name=io-ratio`	Fraction of time that the I/O thread spent doing I/O
`kafka.producer:type=producer-metrics,client-id=([-.w]+),name=io-wait-ratio`	Fraction of time that the I/O thread spent waiting

Consumers

Throttling

As mentioned earlier with producers, depending on your Confluent Cloud service plan, you can be limited to certain throughput rates for consume (read). If your client applications exceed these consume rates, consider the two options mentioned earlier.

To get throttling metrics per consumer, monitor the following client JMX metrics:

Metric	Description
`kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-avg`	The average time in ms that a broker spent throttling a fetch request
`kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-max`	The maximum time in ms that a broker spent throttling a fetch request

Consumer lag

Your application’s consumer lag is the number of records for any partition that the consumer is behind in the log. This metric is particularly important for real-time consumer applications where the consumer should be processing the newest messages with as low latency as possible.

Monitor consumer lag to see whether the consumer is able to fetch records fast enough from the brokers.

Also consider how the offsets are committed. For example, exactly-once semantics (EOS) provide stronger guarantees while potentially increasing consumer lag. If you are capturing JMX metrics, you can monitor records-lag-max:

Metric	Description
`kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),records-lag-max`	The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group isn’t keeping up with the producers.

Share consumers

Share consumers require a unique set of metrics to track performance, message acknowledgement, share group coordination, and a specialized calculation for consumer lag.

Throttling

Similar to consumers, share consumers enable you to monitor throttling through throttling-specific metrics per share consumer. These are the specific throttle metrics provided for share consumers:

Metric	Description
`kafka.consumer:type=consumer-share-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-avg`	The average time in ms that a broker spent throttling a share fetch request.
`kafka.consumer:type=consumer-share-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-max`	The maximum time in ms that a broker spent throttling a share fetch request.

Additional metrics and granularity

Share consumer acknowledgement metrics track the lifecycle of record acknowledgements within the share group via kafka.consumer:type=consumer-share-metrics,client-id=([-.w]+).

Share coordinator metrics track the activities of the share coordination via kafka.consumer:type=consumer-share-coordinator-metrics,client-id=([-.w]+).

In addition to these metrics, a robust set of share consumers metrics are provided in Metrics in KIP-932.

Note

For any of these metrics you can view data per topic by appending topic=([-.w]+) to the metric path.

Client librdkafka metrics

The librdkafka library provides advanced Kafka telemetry that can be used to monitor the performance of Kafka producers and consumers. Similar to JMX metrics, you can configure the librdkafka library to emit internal metrics at regular intervals, which can then be fed into your monitoring systems. While JMX metrics provide high-level statistics for Kafka consumers and producers, the librdkafka metrics provide telemetry at the broker and topic-partition level. For more information, see statistics metrics in the librdkafka documentation.

...
 "throttle": {
   "min": 0,
   "max": 0,
   "avg": 0,
   "sum": 0,
   "stddev": 0,
   "p50": 0,
   "p75": 0,
   "p90": 0,
   "p95": 0,
   "p99": 0,
   "p99_99": 0,
   "outofrange": 0,
   "hdrsize": 17520,
   "cnt": 0
 }
...

Metric	Description
`throttle`	Top level object that contains rolling window statistics for broker throttling in milliseconds.
`min`	The smallest value.
`max`	The largest value.
`avg`	The average value.
`sum`	The sum of values.
`stddev`	The standard deviation (based on histogram).
`p50`	50th percentile.
`p75`	75th percentile.
`p90`	90th percentile.
`p99`	99th percentile.
`p99_99`	99.99th percentile.
`outofrange`	Values skipped due to out of histogram range.
`hdrsize`	Memory size of Hdr Histogram.
`cnt`	Number of values sampled.

Monitor Confluent Cloud Clients with JMX Monitoring

Producers

Throttling

User processes

Consumers

Throttling

Consumer lag

Share consumers

Throttling

Additional metrics and granularity

Client librdkafka metrics

Related content