The recommended way to monitor Replicator is with Confluent Control Center and the Replicator monitoring extension. For more information regarding the monitoring extension see Replicator Monitoring Extension.
Replicator Monitoring Extension
The Confluent Replicator Monitoring Extension allows for detailed metrics from Replicator tasks to be collected using an exposed
REST API. These endpoints provide the following information:
- Throughput - the number of messages replicated per second.
- Message Lag - the number of messages that have been produced to the origin cluster that have not yet been replicated to the destination.
- Latency - the average time period between message production to the origin cluster and message production to the destination cluster.
The metrics are broken down by connector, task and topic/partition.
These metrics have been designed to integrate with Control Center to provide a centralized view of replication within your clusters.
Activate the Extension
After the installation is complete, you must configure Kafka Connect to recognize the installed extension. First add the extension library to the Connect
worker classpath by setting the
CLASSPATH environment variable.
export CLASSPATH=<path to replicator install>/replicator-rest-extension-<version>.jar
- The location of this REST extension JAR file is dependent on your platform and type of Confluent install.
For example, on Mac OS install with zip.tar, by default,
replicator-rest-extension-<version>.jar is in the Confluent directory in
If you install with the default Ansible settings, the JAR file is located in
- To make sure the CLASSPATH sticks, you might want to add it to your shell configuration file (for example,
.zshrc), source the profile in any open command shells, and check it with
Then, to activate the extension, add the following property in the configuration file that you use to start Replicator (for example,
With the extension is activated, you can use Control Center to monitor replicators. See Use Control Center to monitor replicators in the Replicator Tutorial and Replicators in the Control Center User Guide.
Additional Configuration for Secure Endpoints
When the Connect distributed cluster hosting Replicator has a REST endpoint with SSL authentication enabled (https), you must configure security properties for the SSL keystore and truststore used by the Replicator monitoring extension to communicate with other Connect nodes in the cluster. Thses are set with the following environment variables at the JVM level:
export KAFKA_OPTS: -Djavax.net.ssl.trustStore=/etc/kafka/secrets/kafka.connect.truststore.jks
You should also set
listeners.https. properties, as specified at here, for example:
Replicator Monitoring Extension API Reference
Get Metrics for tasks running on this Connect worker for a given connector.
- connector_name (string) – the name of the connector
GET /WorkerMetrics/replicate-topic HTTP/1.1
HTTP/1.1 200 OK
Get Metrics for all Replicators running in this Connect cluster.
GET /ReplicatorMetrics HTTP/1.1
HTTP/1.1 200 OK
Monitoring JMX Metrics
Like Kafka brokers, Kafka Connect reports metrics via JMX. To monitor Connect and Replicator, set the
JMX_PORT environment variable before starting the Connect Workers. Then collect the reported metrics using your usual monitoring tools. JMXTrans, Graphite and Grafana are a popular combination for collecting and reporting JMX metrics from Kafka.
When you look at the metrics reported using JMX, you will see that Connect exposes Replicator’s consumer metrics and Connect’s producer metrics. You can view the full list of metrics in Monitoring Kafka. Here are some of the important metrics and their significance.
Important Replicator Metrics
- The id of the source cluster from which Replicator is replicating.
- The id of the destination cluster to which Replicator is replicating.
- The name of the destination topic to which Replicator is replicating.
- The number of messages that were produced to the origin cluster, but have not yet arrived to the destination cluster.
- The number of messages replicated per second from the source to destination cluster.
- The number of bytes replicated per second from the source to destination cluster.
- The average time between message production to the source cluster and message production to the destination cluster.
Important Producer Metrics
- If the
io-ratio is low or
io-wait-ratio is high, this means the producer is not very busy and is unlikely to be a bottleneck.
- Reports the producer throughput when writing to destination Kafka.
- If they are consistently close to the configured
batch.size, you may be producing as fast as possible and you’ll want to increase the batch size to get better batching.
- The average per-second number of retried record sends and failed record sends for a topic. High number of those can indicate issues writing to the destination cluster.
- Produce requests may be throttled to meet quotas configured on the destination cluster. If these are non-zero, it indicates that the destination brokers are slowing the producer down and the quotas configuration should be reviewed. For more information on quotas see Enforcing Client Quotas.
- Non-zero values here indicate memory pressure. Connect producers can’t send events fast enough, resulting in full memory buffers that cause Replicator threads to block.
Important Consumer Metrics
- If the
io-ratio is low or
io-wait-ratio is high, this means the consumer is not very busy and is unlikely to be a bottleneck.
- Indicates throughput of Replicator reading events from origin cluster.
- If they are close to the configured maximum fetch size consistently, it means that Replicator is reading at the maximum possible rate. Increase the maximum fetch size and check if the throughput per task is improved.
- The maximum lag in terms of number of records for any partition. An increasing value over time indicates that Replicator is not keeping up with the rate at which events are written to the origin cluster.
- If fetch-rate is high but fetch-size-avg and fetch-size-max are not close to the maximum configured fetch size, perhaps the consumer is “churning”. Try increasing the
fetch.max.wait configuration. This can help the consumer batch more efficiently.
- Fetch requests may be throttled to meet quotas configured on the origin cluster. If these are non-zero, it indicates that the origin brokers are slowing the consumer down and the quotas configuration should be reviewed. For more information on quotas see Enforcing Client Quotas
Monitoring Replicator Lag
For Replicator with version 5.4.0 and above, it is recommended that you use the previously mentioned JMX metrics to monitor Replicator lag as
it is more accurate than using the consumer group lag tool. The following methodology to monitor Replicator lag is only recommended if you are
using Replicator with a version below 5.4.0.
You can monitor Replicator lag by using the Consumer Group Command tool
kafka-consumer-groups). To use this functionality, you must set the Replicator
offset.topic.commit config to
true (the default value).
Replication lag is the number of messages that were produced to the origin cluster, but have not yet arrived to the destination cluster. It can also be measured as the amount of time it currently takes for a message to get replicated from origin to
destination. Note that this can be higher than the latency between the two datacenters if Replicator is behind for some
reason and needs time to catch up.
The main reasons to monitor replication lag are:
- If there is a need to failover from origin to destination and if the origin cannot be restored, all events that were produced to origin and not replicated to the target will be lost. (If the origin can be restored, the events will not be lost.)
- Any event processing that happens at the destination will be delayed by the lag.
The lag is typically just a few hundred milliseconds (depending on the network latency between the two datacenters), but
it can grow larger if network partitions or configuration changes temporarily pause replication and the replicator needs
to catch up. If the replication lag keeps growing, it indicates that Replicator throughput is lower than what gets produced
to the origin cluster and that additional Replicator tasks or Connect Workers are necessary. For example, if producers are
writing 100 MBps to the origin cluster, but the Replicator only replicates 50 MBps.
To increase the throughput, the TCP socket buffer should be increased on the Replicator and the brokers. When
Replicator is running in the destination cluster (recommended), you must also increase the following:
- The TCP send socket buffer (
socket.send.buffer.bytes) on the source cluster brokers.
- The receive TCP socket buffer (
socket.receive.buffer.bytes) on the consumers. A value of 512 KB is reasonable
but you may want to experiment with values up to 12 MB.
If you are using Linux, you might need to change the default socket buffer maximum for the Kafka settings to take
effect. For more information about tuning your buffers, see this article.