Monitoring Cluster Metrics and Optimizing Links for Cluster Linking on Confluent Platform¶
Confluent Platform exposes several metrics through Java Management Extensions (JMX) that are useful for monitoring Cluster Linking. Some of these are derived from extending existing interfaces, and others are new for Cluster Linking.
To monitor Cluster Linking, set the JMX_PORT
environment variable before
starting the cluster, then collect the reported metrics using your usual
monitoring tools. JMXTrans, Graphite, and Grafana are a popular combination for
collecting and reporting JMX metrics from Kafka. Datadog is another popular
monitoring solution.
Wherever possible, metrics are categorized below per associated link names.
Also, you can set quotas to limit resources and bandwidth used by Cluster Linking to help optimize the system.
Cluster Link Fetcher Metrics¶
- kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag,clientId=ClusterLink,link-name={linkName}
- Maximum lag in messages between the replicas on the destination cluster and the leader replica on the source cluster. Provides an indication of overall cluster link health and a relative measure of whether the destination cluster is lagging behind.
- kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},topic={topic},partition={partition},link-name={linkName}
- Drill-down of
MaxLag
, which provides lag in number of messages per replica on the destination cluster from the leader replica on the source cluster. Useful to debug any specific lagging topic partitions/replicas on the destination cluster. - kafka.server.link:type=ClusterLinkFetcherManager,name=DeadThreadCount,clientId=ClusterLink,link-name={linkName}
- Number of dead threads of type
ClusterLinkThread
, where a dead thread is indicative of an issue with fetching from the source cluster for a partition. - kafka.server:type=FetcherStats,name=BytesPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},brokerHost={host},brokerPort={port},link-name={linkName}
- Rate at which data is fetched from the source cluster. Indicates amount of throughput in bytes per second on the cluster link.
- kafka.server:type=FetcherStats,name=RequestsPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId}, brokerHost={host},brokerPort={port},link-name={linkName}
- Rate at which requests are issued to the source cluster over the cluster link.
Throttle Time Metrics for Cluster Link Fetchers¶
- kafka.server:type=cluster-link,link-name={linkName}
- Throttle time metrics may indicate that a cluster link connection is being throttled, which is useful for understanding why lag may be increasing over a cluster link.
Name | Description |
---|---|
fetch-throttle-time-avg , fetch-throttle-time-max |
Gives throttle times for the cluster linking fetchers. May indicate increases in lag on the cluster link due to throttling/quotas being enforced. |
Cluster Link Network Client Metrics¶
- kafka.server:type=cluster-link-metadata-metrics,link-name={linkName}
- Metrics pertaining to the metadata refresh client in the cluster link.
- kafka.server:type=cluster-link-fetcher-metrics,link-name={linkName},broker-id={id},fetcher-id={id}
- Metrics pertaining to data fetcher requests in the cluster link.
Metrics seen through cluster-link-fetcher-metrics
are shown in the following table.
Name | Description |
---|---|
connection-count |
Number of connections for a cluster link |
connection-creation-rate , connection-creation-total |
Rate per second and total for connections created for the cluster link. If the rate is high, the source cluster may be overloaded with connections. |
connection-close-rate , connection-close-total |
Rate per second and total for connections closed for the cluster link. Can be compared to connection creation to understand the balance of creating/closing connections. |
incoming-byte-rate , incoming-byte-total |
Number of bytes per second and total received from the source cluster. |
outgoing-byte-rate , outgoing-byte-total |
Number of bytes per second and total sent to the source cluster. |
network-io-rate , network-io-total |
Indicates rate and total network input/output (IO). |
request-rate , request-total |
Rate per second and total requests issued to the source cluster over the cluster link. |
response-rate , response-total |
Rate per second and total number of responses received from the source cluster over the cluster link. |
request-size-avg , request-size-max |
Average and maximum size in bytes of requests issued to the source cluster over the cluster link. |
io-ratio , io-wait-time-ns-avg , io-waittime-total , iotime-total |
Statistics of the destination cluster’s network IO for the requests over the cluster link. |
successful-authentication-rate , successful-authentication-total |
Rate per second and total number of cluster link clients authenticating to the source cluster over the cluster link. |
successful-reauthentication-rate , successful-reauthentication-total |
Rate per second and total re-authentication to the source cluster over the cluster link. |
failed-reauthentication-rate , failed-reauthentication-total |
Rate per second and total failed re-authentications to the source cluster over the cluster link. If failures are present, it could indicate misconfigured or stale cluster link credentials. |
reauthentication-latency-avg |
Average re-authentication latency. Helps to assess whether clients are taking too long to authenticate to the clusters. |
Request Metrics for Cluster Link APIs¶
- kafka.network:type=RequestMetrics,name={ LocalTimeMs| RemoteTimeMs | RequestBytes | RequestQueueTimeMs| ResponseQueueTimeMs | ResponseSendIoTimeMs | ResponseSendTimeMs | ThrottleTimeMs| TotalTimeMs },request={CreateClusterLinks| DeleteClusterLinks| ListClusterLinks}
- Depending on the request name, provides statistics on requests on the cluster link, including requests and response times, time requests wait in the queue, size of requests (bytes), and so forth.
- kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=ClusterLink
- Provides metrics on delayed operations on the cluster link.
Metrics on the Destination Cluster¶
Starting with Confluent Platform 7.0.0, the following broker metrics, specific to Cluster Linking, are available on the brokers in the destination cluster.
- kafka.server:type=cluster-link-metrics,name=mirror-partition-count,link-name={linkName}
- Number of actively mirrored partitions for which the broker is leader
- kafka.server.link:type=ClusterLinkFetcherManager,name=FailedPartitionsCount,clientId=ClusterLink,link-name={linkName}
- Number of failed mirror partitions for which the broker is leader. A partition may be in a failed state if the source partition’s epoch has gone backwards; for example, if the topic was recreated.
- kafka.server:type=ReplicaManager,name=UnderMinIsrMirrorPartitionCount
- Number of mirrored partitions that are under min ISR.
- kafka.server:type=ReplicaManager,name=BlockedOnMirrorSourcePartitionCount
- Number of mirrored partitions that are blocked on fetching data due to issues on the source cluster.
- kafka.server:type=ReplicaManager,name=UnderReplicatedMirrorPartitions
- Number of mirrored partitions that are under replicated.
- kafka.server:type=cluster-link-metrics,name=(linked-leader-epoch-change-rate, linked-leader-epoch-change-total),link-name={linkName}
Rate per second and total number of times leader election was triggered on this broker due to source leader changes.
- Frequent triggers for leader election might indicate issues on the source cluster. This can be a useful metric during on-premises to Confluent Cloud migrations to identify if there are issues on the source cluster.
- kafka.server:type=cluster-link-metrics,name=(linked-topic-partition-addition-rate, linked-topic-partition-addition-total),link-name={linkName}
Rate per second and total number of times at which the partition count was updated due to source changes.
- A high volume or rate of partition count updates might indicate issues on the source cluster.
- kafka.server:type=cluster-link-metrics,name=(consumer-offset-committed-rate, consumer-offset-committed-total),link-name={linkName}
Rate per second and total number of consumer offsets committed for a cluster link.
- Helps verify if offset migration is happening correctly, a critical metric for failover readiness scenarios.
- kafka.server:type=cluster-link-metrics,name=(topic-config-update-rate, topic-config-update-total),link-name={linkName}
Rate per second and total number of topic configurations updates due to source topic configuration changes.
- Can help troubleshooting and debugging issues where configurations are not propagated.
- kafka.server:type=cluster-link-metrics,name=(acls-added-rate, acls-added-total),link-name={linkName}
- Rate per second and total number of access control lists (ACLs) that are being added for each link.
- kafka.server:type=cluster-link-metrics,name=(acls-deleted-rate, acls-deleted-total),link-name={linkName
- Rate per second and total number of access control lists (ACLs) that are being deleted for each link.
- kafka.server:type=cluster-link-metrics,state=Mirror,link-name={link_name},name=mirror-topic-count
Total number of active (healthy) mirror topics.
Note
Note: Known issue: This metric will include topics that are in the
SOURCE_UNAVAILABLE
state. As a temporary workaround, usekafka.server:type=cluster-link-metrics,mode={mode},state=active,link-name={linkName}
andkafka.server:type=cluster-link-metrics,mode={mode},state=unavailable,link-name={linkName}
to alert on the source absence.- kafka.server:type=cluster-link-metrics,mode={mode},state=active,link-name={linkName}
- Total number of active cluster links connected to the cluster. You can filter or group by direction (source or destination).
- kafka.server:type=cluster-link-metrics,mode={mode},state=paused,link-name={linkName}
- Total number of paused cluster links connected to the cluster. You can filter or group by direction (source or destination).
- kafka.server:type=cluster-link-metrics,mode={mode},state=unavailable,link-name={linkName}
- Total number of unavailable cluster links connected to the cluster. You can filter or group by direction (source or destination).
- kafka.server:type=cluster-link-metrics,state=PausedMirror,link-name={link_name},name=mirror-topic-count
- Total number of mirror topics for which mirroring from the source has been paused.
- kafka.server:type=cluster-link-metrics,state=PendingStoppedMirror,link-name={link_name},name=mirror-topic-count
- Total number of mirror topics whose mirroring has been temporarily stopped.
- kafka.server:type=cluster-link-metrics,state=FailedMirror,link-name={link_name},name=mirror-topic-count
- Total number of failed mirror topics; that is, mirroring cannot proceed.
Metrics on the Source Cluster¶
The source cluster (with Kafka or Confluent Server previous to 6.1.0) is unaware of cluster links, but can monitor cluster link related usage by allocating link-specific credentials. Quota metrics for each credential is available in both Kafka and Confluent Platform when user quotas are enabled.
- kafka.server:type=Fetch,user={linkPrincipal},client-id={optionalLinkClientId}
- byte-rate
- throttle-time
- kafka.server:type=Request,user={linkPrincipal},client-id={optionalLinkClientId}
- request-time
- throttle-time
If the source cluster is Confluent Server with a version of 6.1.0 or higher, then you can also monitor cluster links on the source cluster with the following metrics.
- kafka.server:type=cluster-link-source-metrics,request={request},link-id={linkUUID}
- request-byte-total
- request-total
- response-byte-total
- response-time-ns-avg
Setting Quotas on Resources and Bandwidth Usage¶
You can set various types of quotas to place caps on resources and bandwidth used for Cluster Linking on the source and destination clusters.
Destination Cluster Quotas¶
You can limit total usage for cluster links on each broker in the destination cluster by setting the broker config confluent.cluster.link.io.max.bytes.per.second
.
Source Cluster Quotas¶
The source cluster is unaware of cluster links, but can limit usage for cluster links by allocating link-specific credentials and assigning quotas for the link principal.
- Fetch byte-rate (replication throughput) for each cluster link principal
- Request rate quota (CPU/thread usage) for each cluster link principal. This also includes CPU usage for metadata and configuration sync as well as ACL and consumer offset migration.
Suggested Reading¶
- See Monitoring Kafka with JMX for a full list of metrics for Confluent Platform.
- See the Apache Kafka® documentation on metrics reporting and monitoring through JMX endpoints.