Monitor Confluent Cloud Clients with JMX Monitoring

Kafka applications expose some internal Java Management Extensions (JMX) metrics, and many users run JMX exporters to feed the metrics into their monitoring systems. You can retrieve JMX metrics for your client applications and the services you manage by starting your Kafka client applications with the JMX_PORT environment variable configured. You cannot use JMX metrics for the Confluent-managed services.

Producers

Throttling

Depending on your Confluent Cloud service plan, you can be limited to certain throughput rates for produce (write). If your client applications exceed the produce rates, the quotas on the brokers detect it and the brokers throttle the client application requests. It’s important to ensure your applications aren’t consuming more resources than they should be. If the brokers throttle the client application requests, consider the following two options:

  • Make modifications to the application to optimize its throughput. For more information on how to optimize throughput, see Optimize Confluent Cloud Clients for Throughput.

  • Choose a cluster configuration with higher limits. In Confluent Cloud, you can choose from Standard and Dedicated clusters, which are customizable for higher limits. The Metrics API can give you some indication of throughput from the server side, but it doesn’t provide throughput metrics on the client side. For more information about cluster limits, see Kafka Cluster Types in Confluent Cloud.

To get throttling metrics per producer, monitor the following client JMX metrics:

Metric

Description

kafka.producer:type=producer-metrics,client-id=([-.w]+),name=produce-throttle-time-avg

The average time in ms that a request was throttled by a broker

kafka.producer:type=producer-metrics,client-id=([-.w]+),name=produce-throttle-time-max

The maximum time in ms that a request was throttled by a broker

User processes

To further tune the performance of your producer, monitor the producer time spent in user processes if the producer has non-blocking code to send messages. Using the following io-ratio and io-wait-ratio metrics where user processing time is the fraction of time not spent in either of these. If time in these are low, then the user processing time may be high, which keeps the single producer I/O thread busy. For example, you can check if the producer is using any callbacks, which are invoked when messages have been acknowledged and run in the I/O thread:

Metric

Description

kafka.producer:type=producer-metrics,client-id=([-.w]+),name=io-ratio

Fraction of time that the I/O thread spent doing I/O

kafka.producer:type=producer-metrics,client-id=([-.w]+),name=io-wait-ratio

Fraction of time that the I/O thread spent waiting

Consumers

Throttling

As mentioned earlier with producers, depending on your Confluent Cloud service plan, you can be limited to certain throughput rates for consume (read). If your client applications exceed these consume rates, consider the two options mentioned earlier.

To get throttling metrics per consumer, monitor the following client JMX metrics:

Metric

Description

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-avg

The average time in ms that a broker spent throttling a fetch request

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-max

The maximum time in ms that a broker spent throttling a fetch request

Consumer lag

Your application’s consumer lag is the number of records for any partition that the consumer is behind in the log. This metric is particularly important for real-time consumer applications where the consumer should be processing the newest messages with as low latency as possible.

Monitor consumer lag to see whether the consumer is able to fetch records fast enough from the brokers.

Also consider how the offsets are committed. For example, exactly-once semantics (EOS) provide stronger guarantees while potentially increasing consumer lag. If you are capturing JMX metrics, you can monitor records-lag-max:

Metric

Description

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),records-lag-max

The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group isn’t keeping up with the producers.

Share consumers

Share consumers require a unique set of metrics to track performance, message acknowledgement, share group coordination, and a specialized calculation for consumer lag.

Throttling

Similar to consumers, share consumers enable you to monitor throttling through throttling-specific metrics per share consumer. These are the specific throttle metrics provided for share consumers:

Metric

Description

kafka.consumer:type=consumer-share-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-avg

The average time in ms that a broker spent throttling a share fetch request.

kafka.consumer:type=consumer-share-fetch-manager-metrics,client-id=([-.w]+),name=fetch-throttle-time-max

The maximum time in ms that a broker spent throttling a share fetch request.

Additional metrics and granularity

Share consumer acknowledgement metrics track the lifecycle of record acknowledgements within the share group via kafka.consumer:type=consumer-share-metrics,client-id=([-.w]+).

Share coordinator metrics track the activities of the share coordination via kafka.consumer:type=consumer-share-coordinator-metrics,client-id=([-.w]+).

In addition to these metrics, a robust set of share consumers metrics are provided in Metrics in KIP-932.

Note

For any of these metrics you can view data per topic by appending topic=([-.w]+) to the metric path.

Client librdkafka metrics

The librdkafka library provides advanced Kafka telemetry that can be used to monitor the performance of Kafka producers and consumers. Similar to JMX metrics, you can configure the librdkafka library to emit internal metrics at regular intervals, which can then be fed into your monitoring systems. While JMX metrics provide high-level statistics for Kafka consumers and producers, the librdkafka metrics provide telemetry at the broker and topic-partition level. For more information, see statistics metrics in the librdkafka documentation.

...
 "throttle": {
   "min": 0,
   "max": 0,
   "avg": 0,
   "sum": 0,
   "stddev": 0,
   "p50": 0,
   "p75": 0,
   "p90": 0,
   "p95": 0,
   "p99": 0,
   "p99_99": 0,
   "outofrange": 0,
   "hdrsize": 17520,
   "cnt": 0
 }
...

Metric

Description

throttle

Top level object that contains rolling window statistics for broker throttling in milliseconds.

min

The smallest value.

max

The largest value.

avg

The average value.

sum

The sum of values.

stddev

The standard deviation (based on histogram).

p50

50th percentile.

p75

75th percentile.

p90

90th percentile.

p99

99th percentile.

p99_99

99.99th percentile.

outofrange

Values skipped due to out of histogram range.

hdrsize

Memory size of Hdr Histogram.

cnt

Number of values sampled.