When developing Apache Kafka® applications on Confluent Cloud, it is important to monitor the
health of your deployment to maintain optimal performance. Performance
monitoring helps to achieve this by providing a quantitative account of the
health of each Kafka or Confluent component.
Since production environments are dynamic–that is, data profiles may change, you
may add new clients and enable new features–performance monitoring is crucial.
Ongoing monitoring is as much about identifying and responding to potential
failures as it is about ensuring the services goals are consistently met even as
the production environment changes. Before going into production, you should
have a robust monitoring system for all the producers, consumers, topics, and
any other Kafka or Confluent components you’re using.
The information in this page helps you configure a robust performance monitoring
system when working with Confluent Cloud.
The Confluent Cloud Metrics API provides programmatic access to actionable metrics for
your Confluent Cloud deployment, including server-side metrics for the
Confluent-managed services. However, the Metrics API does not allow you to get
client-side metrics. To retrieve client-side metrics, see Producers and Consumers.
The Metrics API is enabled by default, and aggregates metrics at the topic and
cluster level. Any authorized user can gain access to the metrics, which allows
you to monitor overall usage and performance. To get started with the Metrics
API, see the Confluent Cloud Metrics API documentation.
You can use the Metrics API to query metrics at the following granularities (other
resolutions are available if needed):
- Bytes produced per minute grouped by topic
- Bytes consumed per minute grouped by topic
- Max retained bytes per hour over two hours for a given topic
- Max retained bytes per hour over two hours for a given cluster
You can retrieve the metrics easily over the internet using HTTPS, capturing
them at regular intervals to get a time series and an operational view of
cluster performance. You can integrate the metrics into any cloud provider
monitoring tools like Azure Monitor,
Google Cloud’s operations suite (formerly Stackdriver), or
Amazon CloudWatch, or into existing
monitoring systems like Prometheus and Datadog, and then plot them in a time series graph to
see usage over time. When writing your own application to use the Metrics API,
see the full API specification
to use advanced features.
Client JMX Metrics
Kafka applications expose some internal Java Management Extensions (JMX) metrics,
and many users run JMX exporters to feed the metrics into their monitoring
systems. You can retrieve JMX metrics for your client applications and the
services you manage (though not for the Confluent-managed services, which are
not directly exposed to users) by starting your Kafka client applications with
JMX_PORT environment variable configured. There are many
Kafka-internal metrics that are exposed through JMX to
provide insight on the performance of your applications.
Depending on your Confluent Cloud service plan, you may be limited to certain
throughput rates for produce (write). If your client applications exceed the
produce rates, the quotas on the brokers detect it and the brokers throttle the
client application requests. It’s important to ensure your applications aren’t
consuming more resources than they should be. If the brokers throttle the client
application requests, consider the following two options:
- Make modifications to the application to optimize its throughput.
For more information on how to optimize throughput, see
Optimizing for Throughput.
- Upgrade to a cluster configuration with higher limits.
In Confluent Cloud, you can choose from Standard and Dedicated clusters
(customizable for higher limits). The Metrics API can give you some
indication of throughput from the server side, but it doesn’t provide
throughput metrics on the client side.
To get throttling metrics per producer, monitor the following client JMX metrics:
|The average time in ms that a request was throttled by a broker
|The maximum time in ms that a request was throttled by a broker
To further tune the performance of your producer, monitor the producer time
spent in user processes if the producer has non-blocking code to send messages.
Using the following
io-wait-ratio metrics where user
processing time is the fraction of time not spent in either of these. If time in
these are low, then the user processing time may be high, which keeps the single
producer I/O thread busy. For example, you can check if the producer is using
any callbacks, which are invoked when messages have been acknowledged and run in
the I/O thread:
|Fraction of time that the I/O thread spent doing I/O
|Fraction of time that the I/O thread spent waiting
As mentioned earlier with producers, depending on your Confluent Cloud service plan,
you may be limited to certain throughput rates for consume (read). If your
client applications exceed these consume rates, consider the two options
To get throttling metrics per consumer, monitor the following client JMX metrics:
|The average time in ms that a broker spent throttling a fetch request
|The maximum time in ms that a broker spent throttling a fetch request
You should also monitor your application’s
consumer lag, which is the number
of records for any partition that the consumer is behind in the log. This metric
is particularly important for real-time consumer applications where the consumer
should be processing the newest messages with as low latency as possible.
Monitoring consumer lag can indicate whether the consumer is able to fetch
records fast enough from the brokers. Also consider how the offsets are
committed. For example, exactly-once semantics (EOS) provide stronger guarantees
while potentially increasing consumer lag. You can monitor consumer lag from the
Confluent Cloud user interface, as described in the Monitor Consumer Lag
documentation. If you are capturing JMX metrics, you can monitor
|The maximum lag in terms of number of records for any partition in this
window. An increasing value over time is your best indication that the
consumer group isn’t keeping up with the producers.