When developing Apache Kafka® applications on Confluent Cloud, it is important to monitor the health of your deployment to maintain optimal performance. Performance monitoring helps to achieve this by providing a quantitative account of the health of each Kafka or Confluent component.
Since production environments are dynamic–that is, data profiles may change, you may add new clients and enable new features–performance monitoring is crucial. Ongoing monitoring is as much about identifying and responding to potential failures as it is about ensuring the services goals are consistently met even as the production environment changes. Before going into production, you should have a robust monitoring system for all the producers, consumers, topics, and any other Kafka or Confluent components you’re using.
The information in this page helps you configure a robust performance monitoring system when working with Confluent Cloud.
The Confluent Cloud Metrics provides programmatic access to actionable metrics for your Confluent Cloud deployment, including server-side metrics for the Confluent-managed services. However, the Metrics API does not allow you to get client-side metrics. To retrieve client-side metrics, see Producers and Consumers.
The Metrics API is enabled by default, and aggregates metrics at the topic and cluster level. Any authorized user can gain access to the metrics, which allows you to monitor overall usage and performance. To get started with the Metrics API, see the Confluent Cloud Metrics documentation.
You can use the Metrics API to query metrics at the following granularities (other resolutions are available if needed):
- Bytes produced per minute grouped by topic
- Bytes consumed per minute grouped by topic
- Max retained bytes per hour over two hours for a given topic
- Max retained bytes per hour over two hours for a given cluster
You can retrieve the metrics easily over the internet using HTTPS, capturing them at regular intervals to get a time series and an operational view of cluster performance. You can integrate the metrics into any cloud provider monitoring tools like Azure Monitor, Google Cloud’s operations suite (formerly Stackdriver), or Amazon CloudWatch, or into existing monitoring systems like Prometheus and Datadog, and then plot them in a time series graph to see usage over time. When writing your own application to use the Metrics API, see the full API specification to use advanced features.
Client JMX Metrics¶
Kafka applications expose some internal Java Management Extensions (JMX) metrics,
and many users run JMX exporters to feed the metrics into their monitoring
systems. You can retrieve JMX metrics for your client applications and the
services you manage (though not for the Confluent-managed services, which are
not directly exposed to users) by starting your Kafka client applications with
JMX_PORT environment variable configured. There are many
Kafka-internal metrics that are exposed through JMX to
provide insight on the performance of your applications.
Depending on your Confluent Cloud service plan, you may be limited to certain throughput rates for produce (write). If your client applications exceed the produce rates, the quotas on the brokers detect it and the brokers throttle the client application requests. It’s important to ensure your applications aren’t consuming more resources than they should be. If the brokers throttle the client application requests, consider the following two options:
- Make modifications to the application to optimize its throughput. For more information on how to optimize throughput, see Optimizing for Throughput.
- Upgrade to a cluster configuration with higher limits. In Confluent Cloud, you can choose from Standard and Dedicated clusters (customizable for higher limits). The Metrics API can give you some indication of throughput from the server side, but it doesn’t provide throughput metrics on the client side.
To get throttling metrics per producer, monitor the following client JMX metrics:
||The average time in ms that a request was throttled by a broker|
||The maximum time in ms that a request was throttled by a broker|
To further tune the performance of your producer, monitor the producer time
spent in user processes if the producer has non-blocking code to send messages.
Using the following
io-wait-ratio metrics where user
processing time is the fraction of time not spent in either of these. If time in
these are low, then the user processing time may be high, which keeps the single
producer I/O thread busy. For example, you can check if the producer is using
any callbacks, which are invoked when messages have been acknowledged and run in
the I/O thread:
||Fraction of time that the I/O thread spent doing I/O|
||Fraction of time that the I/O thread spent waiting|
As mentioned earlier with producers, depending on your Confluent Cloud service plan, you may be limited to certain throughput rates for consume (read). If your client applications exceed these consume rates, consider the two options mentioned earlier.
To get throttling metrics per consumer, monitor the following client JMX metrics:
||The average time in ms that a broker spent throttling a fetch request|
||The maximum time in ms that a broker spent throttling a fetch request|
You should also monitor your application’s
consumer lag, which is the number
of records for any partition that the consumer is behind in the log. This metric
is particularly important for real-time consumer applications where the consumer
should be processing the newest messages with as low latency as possible.
Monitoring consumer lag can indicate whether the consumer is able to fetch
records fast enough from the brokers. Also consider how the offsets are
committed. For example, exactly-once semantics (EOS) provide stronger guarantees
while potentially increasing consumer lag. You can monitor consumer lag from the
Confluent Cloud user interface, as described in the Monitor Consumer Lag
documentation. If you are capturing JMX metrics, you can monitor
||The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group isn’t keeping up with the producers.|
For an example that shows how to monitor an Apache Kafka® client application and Confluent Cloud metrics, and steps through various failure scenarios to show metrics results, see the Observability for Apache Kafka® Clients to Confluent Cloud demo.