Monitor Cluster Metrics and Optimize Links for Cluster Linking on Confluent Platform¶

Confluent Platform exposes several metrics through Java Management Extensions (JMX) that are useful for monitoring Cluster Linking. Some of these are derived from extending existing interfaces, and others are new for Cluster Linking.

To monitor Cluster Linking, set the JMX_PORT environment variable before starting the cluster, then collect the reported metrics using your usual monitoring tools. JMXTrans, Graphite, and Grafana are a popular combination for collecting and reporting JMX metrics from Kafka. Datadog is another popular monitoring solution.

Wherever possible, metrics are categorized below per associated link names.

Also, you can set quotas to limit resources and bandwidth used by Cluster Linking to help optimize the system.

Cluster link fetcher metrics¶

kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag,clientId=ClusterLink,link-name={linkName}: Maximum lag in messages between the replicas on the destination cluster and the leader replica on the source cluster. Provides an indication of overall cluster link health and a relative measure of whether the destination cluster is lagging behind.
kafka.server.link:type=ClusterLinkFetcherManager,name=FetchTimeMax,clientId=ClusterLink,link-name={linkName} with state=fetch: This metric shows the longest time gap between successful data fetches for a partition. The clock starts when a partition is ready to be used and stops when a successful fetch occurs (even if no new data comes in). If the time since the last fetch is longer than the delay from the previous fetch, that longer period is used. Partitions with unavailable, failed, or degraded links are not counted. Normally, the fetch interval is low. If you notice a consistently high interval, it could signal an underlying issue.
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},topic={topic},partition={partition},link-name={linkName}: Drill-down of MaxLag, which provides lag in number of messages per replica on the destination cluster from the leader replica on the source cluster. Useful to debug any specific lagging topic partitions/replicas on the destination cluster.
kafka.server.link:type=ClusterLinkFetcherManager,name=DeadThreadCount,clientId=ClusterLink,link-name={linkName}: Number of dead threads of type ClusterLinkThread, where a dead thread is indicative of an issue with fetching from the source cluster for a partition.
kafka.server.link:type=ClusterLinkFetcherManager,name=DegradedPartitionCount,clientId=ClusterLink,link-name={linkName}: Partitions that have been moved to degraded state, typically due to a non-critical error after retry timeout. These include user errors, such as authorization failure and includes link availability failures, indicated by the reason tag. A sustained value greater than (>) 0 beyond multiple metadata intervals and availability timeout may indicate an issue, depicted by the reason tag internal.
kafka.server:type=FetcherStats,name=BytesPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},brokerHost={host},brokerPort={port},link-name={linkName}: Rate at which data is fetched from the source cluster. Indicates amount of throughput in bytes per second on the cluster link.
kafka.server:type=FetcherStats,name=RequestsPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId}, brokerHost={host},brokerPort={port},link-name={linkName}: Rate at which requests are issued to the source cluster over the cluster link.

Throttle time metrics for cluster link fetchers¶

kafka.server:type=cluster-link,link-name={linkName}: Throttle time metrics may indicate that a cluster link connection is being throttled, which is useful for understanding why lag may be increasing over a cluster link.

Name	Description
`fetch-throttle-time-avg`, `fetch-throttle-time-max`	Gives throttle times for the cluster linking fetchers. May indicate increases in lag on the cluster link due to throttling/quotas being enforced.

Cluster link network client metrics¶

kafka.server:type=cluster-link-metadata-metrics,link-name={linkName}: Metrics pertaining to the metadata refresh client in the cluster link.
kafka.server:type=cluster-link-fetcher-metrics,link-name={linkName},broker-id={id},fetcher-id={id}: Metrics pertaining to data fetcher requests in the cluster link.

Metrics seen through cluster-link-fetcher-metrics are shown in the following table.

Name	Description
`connection-count`	Number of connections for a cluster link
`connection-creation-rate`, `connection-creation-total`	Rate per second and total for connections created for the cluster link. If the rate is high, the source cluster may be overloaded with connections.
`connection-close-rate`, `connection-close-total`	Rate per second and total for connections closed for the cluster link. Can be compared to connection creation to understand the balance of creating/closing connections.
`incoming-byte-rate`, `incoming-byte-total`	Number of bytes per second and total received from the source cluster.
`outgoing-byte-rate`, `outgoing-byte-total`	Number of bytes per second and total sent to the source cluster.
`network-io-rate`, `network-io-total`	Indicates rate and total network input/output (IO).
`request-rate`, `request-total`	Rate per second and total requests issued to the source cluster over the cluster link.
`response-rate`, `response-total`	Rate per second and total number of responses received from the source cluster over the cluster link.
`request-size-avg`, `request-size-max`	Average and maximum size in bytes of requests issued to the source cluster over the cluster link.
`io-ratio`, `io-wait-time-ns-avg`, `io-waittime-total`, `iotime-total`	Statistics of the destination cluster’s network IO for the requests over the cluster link.
`successful-authentication-rate`, `successful-authentication-total`	Rate per second and total number of cluster link clients authenticating to the source cluster over the cluster link.
`successful-reauthentication-rate`, `successful-reauthentication-total`	Rate per second and total re-authentication to the source cluster over the cluster link.
`failed-reauthentication-rate`, `failed-reauthentication-total`	Rate per second and total failed re-authentications to the source cluster over the cluster link. If failures are present, it could indicate misconfigured or stale cluster link credentials.
`reauthentication-latency-avg`	Average re-authentication latency. Helps to assess whether clients are taking too long to authenticate to the clusters.

Request metrics for cluster link APIs¶

kafka.network:type=RequestMetrics,name={ LocalTimeMs| RemoteTimeMs | RequestBytes | RequestQueueTimeMs| ResponseQueueTimeMs | ResponseSendIoTimeMs | ResponseSendTimeMs | ThrottleTimeMs| TotalTimeMs },request={CreateClusterLinks| DeleteClusterLinks| ListClusterLinks}: Depending on the request name, provides statistics on requests on the cluster link, including requests and response times, time requests wait in the queue, size of requests (bytes), and so forth.
kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=ClusterLink: Provides metrics on delayed operations on the cluster link.

Task status metrics¶

kafka.server:type=cluster-link-metrics,name=link-task-count,link-name={linkName},task-name={taskName},state={state},reason={reason},mode={mode}: Monitor the state of link level tasks. For example, monitor if consumer offset syncing is working. If the task is in error, a reason code is provided. You can set up alerts to trigger if errors occur.
kafka.server:type=cluster-link-metrics,name=mirror-transition-in-error,link-name={linkName},state={state},reason={reason}: Monitor mirror topic state transition errors. For example, if a mirror topic encounters errors during the promotion process; that is, while its state is pending_stopped and it is being transitioned to stopped.

Metrics on the destination cluster¶

Starting with Confluent Platform 7.0.0, the following broker metrics, specific to Cluster Linking, are available on the brokers in the destination cluster.

kafka.server:type=cluster-link-metrics,name=mirror-partition-count,link-name={linkName}

Number of actively mirrored partitions for which the broker is leader

kafka.server.link:type=ClusterLinkFetcherManager,name=FailedPartitionsCount,clientId=ClusterLink,link-name={linkName}

Number of failed mirror partitions for which the broker is leader. A partition may be in a failed state if the source partition’s epoch has gone backwards; for example, if the topic was recreated.

kafka.server:type=ReplicaManager,name=UnderMinIsrMirrorPartitionCount

Number of mirrored partitions that are under min ISR.

kafka.server:type=ReplicaManager,name=UnderReplicatedMirrorPartitions

Number of mirrored partitions that are under replicated.

kafka.server:type=cluster-link-metrics,name=(linked-leader-epoch-change-rate, linked-leader-epoch-change-total),link-name={linkName}

Rate per second and total number of times leader election was triggered on this broker due to source leader changes.

Frequent triggers for leader election might indicate issues on the source cluster. This can be a useful metric during on-premises to Confluent Cloud migrations to identify if there are issues on the source cluster.

kafka.server:type=cluster-link-metrics,name=(linked-topic-partition-addition-rate, linked-topic-partition-addition-total),link-name={linkName}

Rate per second and total number of times at which the partition count was updated due to source changes.

A high volume or rate of partition count updates might indicate issues on the source cluster.

kafka.server:type=cluster-link-metrics,state=Mirror,link-name={link_name},name=mirror-topic-count

Total number of active (healthy) mirror topics.

Note

Note: Known issue: This metric will include topics that are in the SOURCE_UNAVAILABLE state. As a temporary workaround, use kafka.server:type=cluster-link-metrics,mode={mode},state=active,link-name={linkName} and kafka.server:type=cluster-link-metrics,mode={mode},state=unavailable,link-name={linkName} to alert on the source absence.

kafka.server:type=cluster-link-metrics,mode={mode},state=active,link-name={linkName}

Total number of active cluster links connected to the cluster. You can filter or group by direction (source or destination).

kafka.server:type=cluster-link-metrics,mode={mode},state=paused,link-name={linkName}

Total number of paused cluster links connected to the cluster. You can filter or group by direction (source or destination).

kafka.server:type=cluster-link-metrics,mode={mode},state=unavailable,link-name={linkName}

Total number of unavailable cluster links connected to the cluster. You can filter or group by direction (source or destination).

kafka.server:type=cluster-link-metrics,state=PausedMirror,link-name={link_name},name=mirror-topic-count

Total number of mirror topics for which mirroring from the source has been paused.

kafka.server:type=cluster-link-metrics,state=PendingStoppedMirror,link-name={link_name},name=mirror-topic-count

Total number of mirror topics whose mirroring has been temporarily stopped.

kafka.server:type=cluster-link-metrics,state=FailedMirror,link-name={link_name},name=mirror-topic-count

Total number of failed mirror topics; that is, mirroring cannot proceed.

kafka.server:type=cluster-link-metrics,state=active,link-name={link_name},name=link-count

Total number of links in an active state.

kafka.server:type=cluster-link-metrics,state=paused,link-name={link_name},name=link-count

Total number of links in a paused state.

kafka.server:type=cluster-link-metrics,state=unavailable,link-name={link_name},name=link-count

Total number of links in an unavailable state, where the source may not be available the link is having trouble fetching.

kafka.server:type=cluster-link-metrics,link-name={link_name},name=broker-failed-link-count

Total number of links in a failed state. This metric is reported on a per broker state, as some or all brokers could have a link in a failed state and unable to connect to the source.

Metrics on the source cluster¶

The source cluster (with Kafka or Confluent Server previous to 6.1.0) is unaware of cluster links, but can monitor cluster link related usage by allocating link-specific credentials. Quota metrics for each credential is available in both Kafka and Confluent Platform when user quotas are enabled.

kafka.server:type=Fetch,user={linkPrincipal},client-id={optionalLinkClientId}

byte-rate
throttle-time

kafka.server:type=Request,user={linkPrincipal},client-id={optionalLinkClientId}

request-time
throttle-time

If the source cluster is Confluent Server with a version of 6.1.0 or higher, then you can also monitor cluster links on the source cluster with the following metrics.

kafka.server:type=cluster-link-source-metrics,request={request},link-id={linkUUID}

request-byte-total
request-total
response-byte-total
response-time-ns-avg

Setting quotas on resources and bandwidth usage¶

You can set various types of quotas to place caps on resources and bandwidth used for Cluster Linking on the source and destination clusters.

Destination cluster quotas¶

You can limit total usage for cluster links on each broker in the destination cluster by setting the broker config confluent.cluster.link.io.max.bytes.per.second.

Source cluster quotas¶

The source cluster is unaware of cluster links, but can limit usage for cluster links by allocating link-specific credentials and assigning quotas for the link principal.

Fetch byte-rate (replication throughput) for each cluster link principal
Request rate quota (CPU/thread usage) for each cluster link principal. This also includes CPU usage for metadata and configuration sync as well as ACL and consumer offset migration.