Monitor Confluent Cloud Clients with JMX Monitoring
Kafka applications expose some internal Java Management Extensions (JMX) metrics, and many users run JMX exporters to feed the metrics into their monitoring systems. You can retrieve JMX metrics for your client applications and the services you manage by starting your Kafka client applications with the JMX_PORT environment variable configured. You cannot use JMX metrics for the Confluent-managed services.
Producers
Throttling
Depending on your Confluent Cloud service plan, you can be limited to certain throughput rates for produce (write). If your client applications exceed the produce rates, the quotas on the brokers detect it and the brokers throttle the client application requests. It’s important to ensure your applications aren’t consuming more resources than they should be. If the brokers throttle the client application requests, consider the following two options:
Make modifications to the application to optimize its throughput. For more information on how to optimize throughput, see Optimize Confluent Cloud Clients for Throughput.
Choose a cluster configuration with higher limits. In Confluent Cloud, you can choose from Standard and Dedicated clusters, which are customizable for higher limits. The Metrics API can give you some indication of throughput from the server side, but it doesn’t provide throughput metrics on the client side. For more information about cluster limits, see Kafka Cluster Types in Confluent Cloud.
To get throttling metrics per producer, monitor the following client JMX metrics:
Metric | Description |
|---|---|
| The average time in ms that a request was throttled by a broker |
| The maximum time in ms that a request was throttled by a broker |
User processes
To further tune the performance of your producer, monitor the producer time spent in user processes if the producer has non-blocking code to send messages. Using the following io-ratio and io-wait-ratio metrics where user processing time is the fraction of time not spent in either of these. If time in these are low, then the user processing time may be high, which keeps the single producer I/O thread busy. For example, you can check if the producer is using any callbacks, which are invoked when messages have been acknowledged and run in the I/O thread:
Metric | Description |
|---|---|
| Fraction of time that the I/O thread spent doing I/O |
| Fraction of time that the I/O thread spent waiting |
Consumers
Throttling
As mentioned earlier with producers, depending on your Confluent Cloud service plan, you can be limited to certain throughput rates for consume (read). If your client applications exceed these consume rates, consider the two options mentioned earlier.
To get throttling metrics per consumer, monitor the following client JMX metrics:
Metric | Description |
|---|---|
| The average time in ms that a broker spent throttling a fetch request |
| The maximum time in ms that a broker spent throttling a fetch request |
Consumer lag
Your application’s consumer lag is the number of records for any partition that the consumer is behind in the log. This metric is particularly important for real-time consumer applications where the consumer should be processing the newest messages with as low latency as possible.
Monitor consumer lag to see whether the consumer is able to fetch records fast enough from the brokers.
Also consider how the offsets are committed. For example, exactly-once semantics (EOS) provide stronger guarantees while potentially increasing consumer lag. If you are capturing JMX metrics, you can monitor records-lag-max:
Metric | Description |
|---|---|
| The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group isn’t keeping up with the producers. |
Client librdkafka metrics
The librdkafka library provides advanced Kafka telemetry that can be used to monitor the performance of Kafka producers and consumers. Similar to JMX metrics, you can configure the librdkafka library to emit internal metrics at regular intervals, which can then be fed into your monitoring systems. While JMX metrics provide high-level statistics for Kafka consumers and producers, the librdkafka metrics provide telemetry at the broker and topic-partition level. For more information, see statistics metrics in the librdkafka documentation.
...
"throttle": {
"min": 0,
"max": 0,
"avg": 0,
"sum": 0,
"stddev": 0,
"p50": 0,
"p75": 0,
"p90": 0,
"p95": 0,
"p99": 0,
"p99_99": 0,
"outofrange": 0,
"hdrsize": 17520,
"cnt": 0
}
...
Metric | Description |
|---|---|
| Top level object that contains rolling window statistics for broker throttling in milliseconds. |
| The smallest value. |
| The largest value. |
| The average value. |
| The sum of values. |
| The standard deviation (based on histogram). |
| 50th percentile. |
| 75th percentile. |
| 90th percentile. |
| 99th percentile. |
| 99.99th percentile. |
| Values skipped due to out of histogram range. |
| Memory size of Hdr Histogram. |
| Number of values sampled. |