Monitor Cluster Metrics and Optimize Links for Cluster Linking on Confluent Platform
Confluent Platform exposes several metrics through Java Management Extensions (JMX) that are useful for monitoring Cluster Linking. Some of these are derived from extending existing interfaces, and others are new for Cluster Linking.
To monitor Cluster Linking, set the JMX_PORT environment variable before starting the cluster, then collect the reported metrics using your usual monitoring tools. JMXTrans, Graphite, and Grafana are a popular combination for collecting and reporting JMX metrics from Kafka. Datadog is another popular monitoring solution.
Quick Start for Administrators
For basic health monitoring, focus on
link-count(state=active) andmirror-topic-countmetricsFor performance monitoring, track
MaxLag,BytesPerSec, and throttle time metricsFor troubleshooting, monitor
link-task-count(state=in-error) and connection failure metricsSee the complete reference section for all available metrics
For comprehensive lists of all possible state and reason code values, see the Enumerated Tag Values Reference section
Also, you can set quotas to limit resources and bandwidth used by Cluster Linking to help optimize the system.
Monitoring brokers
Keep in mind that on KRaft these metrics apply only to brokers (not to the KRaft controller quorum). The broker that is the link coordinator will properly reflect the state of the link.
For example, the JMX MBean for Cluster Linking status is kafka.server:type=cluster-link-metrics,mode={mode},state={state},link-name={linkName}. The state can be active, paused, or unavailable. This MBean represents the total number of links in that specific state.
For the state=active MBean:
Link coordinator broker: Returns
1if the cluster link is active and managed by this brokerNon-coordinator brokers: Return
0
Example: kafka.server:type=cluster-link-metrics,mode=source,state=active,link-name=my-link returns 1 on the link coordinator, 0 on other brokers.
In previous Confluent Platform releases that supported ZooKeeper (pre-8.0), the ZooKeeper “controller” broker acted as the link coordinator, so only that broker held the state of cluster links. The JMX MBean for Cluster Linking status against the “controller” would return 1 if the cluster link was active, but would return 0 on any other broker. ZooKeeper is no longer available on Confluent Platform 8.0 and later. To learn more, see KRaft and ZooKeeper in “What’s supported”.
Cluster link fetcher metrics
- kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag,clientId=ClusterLink,link-name={linkName}
Maximum lag in messages between the replicas on the destination cluster and the leader replica on the source cluster. Provides an indication of overall cluster link health and a relative measure of whether the destination cluster is lagging behind.
- kafka.server.link:type=ClusterLinkFetcherManager,name=FetchTimeMax,clientId=ClusterLink,link-name={linkName},state={state}
This metric shows the longest time gap between successful data fetches for a partition. The clock starts when a partition is ready to be used and stops when a successful fetch occurs (even if no new data comes in). If the time since the last fetch is longer than the delay from the previous fetch, that longer period is used. Partitions with unavailable, failed, or degraded links are not counted. Normally, the fetch interval is low. If you notice a consistently high interval, it could signal an underlying issue.
Available tags: -
state: The fetch state (fetchfor active fetching, other values for different states)- kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},topic={topic},partition={partition},link-name={linkName}
Drill-down of
MaxLag, which provides lag in number of messages per replica on the destination cluster from the leader replica on the source cluster. Useful to debug any specific lagging topic partitions/replicas on the destination cluster.- kafka.server.link:type=ClusterLinkFetcherManager,name=DeadThreadCount,clientId=ClusterLink,link-name={linkName}
Number of dead threads of type
ClusterLinkThread, where a dead thread is indicative of an issue with fetching from the source cluster for a partition.- kafka.server.link:type=ClusterLinkFetcherManager,name=DegradedPartitionCount,clientId=ClusterLink,link-name={linkName}
Partitions that have been moved to degraded state, typically due to a non-critical error after retry timeout. These include user errors, such as authorization failure and includes link availability failures, indicated by the reason tag. A sustained value greater than (>)
0beyond multiple metadata intervals and availability timeout may indicate an issue, depicted by the reason taginternal.- kafka.server:type=FetcherStats,name=BytesPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},brokerHost={host},brokerPort={port},link-name={linkName}
Rate at which data is fetched from the source cluster. Indicates amount of throughput in bytes per second on the cluster link.
- kafka.server:type=FetcherStats,name=RequestsPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId}, brokerHost={host},brokerPort={port},link-name={linkName}
Rate at which requests are issued to the source cluster over the cluster link.
Throttle time metrics for cluster link fetchers
- kafka.server:type=cluster-link,link-name={linkName}
Throttle time metrics may indicate that a cluster link connection is being throttled, which is useful for understanding why lag may be increasing over a cluster link.
Name | Description |
|---|---|
| Gives throttle times for the Cluster Linking fetchers. May indicate increases in lag on the cluster link due to throttling/quotas being enforced. |
Cluster link network client metrics
- kafka.server:type=cluster-link-metadata-metrics,link-name={linkName}
Metrics pertaining to the metadata refresh client in the cluster link.
- kafka.server:type=cluster-link-fetcher-metrics,link-name={linkName},broker-id={id},fetcher-id={id}
Metrics pertaining to data fetcher requests in the cluster link.
Metrics seen through cluster-link-fetcher-metrics are shown in the following table.
Name | Description |
|---|---|
| Number of connections for a cluster link |
| Rate per second and total for connections created for the cluster link. If the rate is high, the source cluster may be overloaded with connections. |
| Rate per second and total for connections closed for the cluster link. Can be compared to connection creation to understand the balance of creating/closing connections. |
| Number of bytes per second and total received from the source cluster. |
| Number of bytes per second and total sent to the source cluster. |
| Indicates rate and total network input/output (IO). |
| Rate per second and total requests issued to the source cluster over the cluster link. |
| Rate per second and total number of responses received from the source cluster over the cluster link. |
| Average and maximum size in bytes of requests issued to the source cluster over the cluster link. |
| Statistics of the destination cluster’s network IO for the requests over the cluster link. |
| Rate per second and total number of cluster link clients authenticating to the source cluster over the cluster link. |
| Rate per second and total re-authentication to the source cluster over the cluster link. |
| Rate per second and total failed re-authentications to the source cluster over the cluster link. If failures are present, it could indicate misconfigured or stale cluster link credentials. |
| Average re-authentication latency. Helps to assess whether clients are taking too long to authenticate to the clusters. |
Request metrics for cluster link APIs
- kafka.network:type=RequestMetrics,name={ LocalTimeMs| RemoteTimeMs | RequestBytes | RequestQueueTimeMs| ResponseQueueTimeMs | ResponseSendIoTimeMs | ResponseSendTimeMs | ThrottleTimeMs| TotalTimeMs },request={CreateClusterLinks| DeleteClusterLinks| ListClusterLinks}
Depending on the request name, provides statistics on requests on the cluster link, including requests and response times, time requests wait in the queue, size of requests (bytes), and so forth.
- kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=ClusterLink
Provides metrics on delayed operations on the cluster link.
Task status metrics
- kafka.server:type=cluster-link-metrics,name=link-task-count,link-name={linkName},task-name={taskName},state={state},reason={reason},mode={mode},connection-mode={connection_mode}
Monitor the state of link level tasks. For example, monitor if consumer offset syncing is working. If the task is in error, a reason code is provided. You can set up alerts to trigger if errors occur.
Available tags:
task-name: The specific task being monitored. Possible values:
consumer-offset-sync: Consumer offset synchronization taskacl-sync: ACL synchronization taskauto-create-mirror: Automatic mirror topic creation tasktopic-configs-sync: Topic configuration synchronization taskclear-mirror-start-offsets: Clear mirror start offsets taskpause-mirror-topics: Pause mirror topics taskcheck-availability: Availability check taskstate-aggregator: State aggregation taskretry-task: Retry task for failed operationsperiodic-partition-scheduler: Periodic partition scheduler taskdegraded-partition-monitor: Degraded partition monitor task
state: The current state of the task. All possible values:
active: Task is currently running without errorsin-error: Task has encountered an error and cannot proceed
reason: Reason code indicating why a task is in error state. All possible values:
no-error: No error present (used when state is active)authentication: Authentication failure with link credentials - link cannot authenticate to remote clusterbroker-authentication: Authentication failure with broker credentials - authentication issues between link and destination brokerauthorization: Authorization failure with link credentials - link lacks required permissions on remote clusterbroker-authorization: Authorization failure with broker credentials - authorization issues between link and destination brokermisconfiguration: Configuration error - invalid or missing link configurationinternal: Internal or unexpected system errorsuppressed-errors: Errors that are being suppressed and handled gracefullyconsumer-group-in-use: Consumer group is currently active on destination - cannot sync offsets while consumers are activeremote-link-not-found: Remote link not found (for bidirectional links) - remote side of bidirectional link is missingsecurity-disabled: Remote cluster has no authorizer configured - remote cluster lacks security configurationacl-limit-exceeded: ACL limit reached on destination cluster - ACL quota has been exceededinvalid-request: Invalid request error - malformed or invalid API requesttopic-exists: Topic already exists on destination - cannot create mirror topic because it already existspolicy-violation: Policy violation error - mirror topic creation violates cluster policiesinvalid-topic: Invalid topic error - topic name or configuration is invalidunknown-topic-or-partition: Topic or partition not found - source topic or partition doesn’t exist
- kafka.server:type=cluster-link-metrics,name=mirror-transition-in-error,link-name={linkName},state={state},reason={reason},mode={mode},connection-mode={connection_mode}
Monitor mirror topic state transition errors. For example, if a mirror topic encounters errors during the promotion process; that is, while its state is
pending_stoppedand it is being transitioned to stopped.
Available tags:
state: The mirror topic transition state where an error occurred. All possible values:
pending_stopped: Mirror topic is in the process of being stopped (e.g., during promotion or failover)pending_mirror: Topic is in the process of being converted to a mirror topicpending_synchronization: Mirror topic is being reversed or swapped (bidirectional operations)pending_restore: Mirror topic is in the process of being prepared for restore operationspending_setup_for_restore: Mirror topic is being set up for restore operationsfailed: Mirror topic repair or recovery operations in progress
reason: Reason code for the transition failure. Uses the same error codes as link task metrics (see above for complete list).
Metrics on the destination cluster
The following broker metrics, specific to Cluster Linking, are available on the brokers in the destination cluster.
- kafka.server:type=cluster-link-metrics,name=mirror-partition-count,link-name={linkName}
Number of actively mirrored partitions for which the broker is leader
- kafka.server.link:type=ClusterLinkFetcherManager,name=FailedPartitionsCount,clientId=ClusterLink,link-name={linkName}
Number of failed mirror partitions for which the broker is leader. A partition may be in a failed state if the source partition’s epoch has gone backwards; for example, if the topic was recreated.
- kafka.server:type=ReplicaManager,name=UnderMinIsrMirrorPartitionCount
Number of mirrored partitions that are under min ISR.
- kafka.server:type=ReplicaManager,name=UnderReplicatedMirrorPartitions
Number of mirrored partitions that are under replicated.
- kafka.server:type=BrokerTopicMetrics,name=MirrorBytesInPerSec,topic={topic}
Rate of bytes per second written to mirror topics through Cluster Linking. This metric is available both per-topic (with topic tag) and as an aggregate across all mirror topics (without topic tag). Available since Confluent Platform 7.6.1.
This metric specifically tracks data written to the destination cluster’s mirror topics, which complements the existing
FetcherStatsmetrics that track overall cluster link throughput. Use this metric for monitoring mirror topic write performance and comparing it with source topic throughput.
Available tags:
topic: The mirror topic name (present for per-topic metrics, absent for aggregate metrics)
- kafka.server:type=cluster-link-metrics,name=(linked-leader-epoch-change-rate, linked-leader-epoch-change-total),link-name={linkName}
Rate per second and total number of times leader election was triggered on this broker due to source leader changes.
Frequent triggers for leader election might indicate issues on the source cluster. This can be a useful metric during on-premises to Confluent Cloud migrations to identify if there are issues on the source cluster.
- kafka.server:type=cluster-link-metrics,name=(linked-topic-partition-addition-rate, linked-topic-partition-addition-total),link-name={linkName}
Rate per second and total number of times at which the partition count was updated due to source changes.
A high volume or rate of partition count updates might indicate issues on the source cluster.
- kafka.server:type=cluster-link-metrics,state={state},link-name={link_name},name=mirror-topic-count
Total number of mirror topics in a specific state. The
statetag indicates the current operational state of mirror topics.
Available mirror topic states (link_mirror_topic_state tag values):
Mirror: Active, healthy mirror topics that are successfully replicating from sourcePausedMirror: Mirror topics for which mirroring from the source has been explicitly pausedPendingStoppedMirror: Mirror topics whose mirroring is being stopped (transitional state, e.g., during promotion)FailedMirror: Failed mirror topics where mirroring cannot proceed due to errorsSOURCE_UNAVAILABLE: Mirror topics where the source cluster or topic is unavailable
Note
Note: Known issue: The Mirror state metric will include topics that are in the SOURCE_UNAVAILABLE state. As a temporary workaround, use kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=active,link-name={linkName} and kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=unavailable,link-name={linkName} to alert on the source absence.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state={state},link-name={linkName},reason={reason}
Total number of cluster links in a specific state. You can filter or group by direction (source or destination) and connection mode (inbound or outbound).
Available link states (link_state tag values):
active: Link is operational and successfully connected to the source clusterpaused: Link has been explicitly paused and is not actively replicatingunavailable: Link cannot connect to the source cluster or is experiencing connectivity issues. Thereasontag provides specific details about why the link is unavailable (uses same reason codes as task metrics documented above)
- kafka.server:type=cluster-link-metrics,state={state},link-name={link_name},mode={mode},connection-mode={connection_mode},reason={reason},name=link-count
Total number of links in a specific state. Use the
statetag to filter by link state (see available link states above).
Additional common tags:
mode: The direction of the cluster link. Possible values:
destination: Links where this cluster acts as the destinationsource: Links where this cluster acts as the sourcebidirectional: Links that can operate in both directions
connection-mode: The connection mode for the cluster link. Possible values:
inbound: Connections coming into this clusteroutbound: Connections going out from this cluster
reason: (Only for unavailable links) The specific reason code why the link is unavailable. Uses the same reason codes as task metrics (see above for complete list).
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=broker-failed-link-count
Total number of links in a failed state. This metric is reported on a per broker state, as some or all brokers could have a link in a failed state and unable to connect to the source.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},remote-link-connection-mode={remote_connection_mode},link-name={link_name},name=controller-reverse-connection-count
Number of persistent reverse connections for this link between the local broker and the remote link coordinator.
Available tags:
remote-link-connection-mode: The connection mode on the remote side. Possible values:
inbound: Remote cluster accepts connections from this clusteroutbound: Remote cluster initiates connections to this cluster
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-count
Total number of reverse connections for this link between local and remote brokers.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-created-total,reverse-connection-created-rate
Total count and rate per second of reverse connections created for this cluster link.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-closed-total,reverse-connection-closed-rate
Total count and rate per second of reverse connections closed for this cluster link.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-failed-total,reverse-connection-failed-rate
Total count and rate per second of reverse connections that failed for this cluster link. Only available for outbound connections.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-count
Number of link fetchers currently active for this cluster link.
- kafka.server:type=cluster-link-metrics,pool={pool},link-name={link_name},name=link-target-fetcher-count
Number of cluster link target fetchers per source broker.
Available tags:
pool: The fetcher pool type. Possible values include different fetcher pool configurations.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-throttled-partition-count
Number of throttled partitions for this cluster link.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-unassigned-partition-count
Number of unassigned partitions for this cluster link.
- kafka.server:type=cluster-link-metrics,reason={reason},link-name={link_name},name=link-fetcher-degraded-partition-count
Number of partitions in degraded state due to specific reasons.
Available tags:
reason: The reason for partition degradation. Specific degradation reasons include various operational issues.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=remote-admin-queue-size
Backlog queue size for the remote admin client for this cluster link.
Metrics on the source cluster
The source cluster (with Kafka or Confluent Server previous to 6.1.0) is unaware of cluster links, but can monitor cluster link related usage by allocating link-specific credentials. Quota metrics for each credential is available in both Kafka and Confluent Platform when user quotas are enabled.
- kafka.server:type=Fetch,user={linkPrincipal},client-id={optionalLinkClientId}
byte-rate
throttle-time
- kafka.server:type=Request,user={linkPrincipal},client-id={optionalLinkClientId}
request-time
throttle-time
If the source cluster is Confluent Server with a version of 6.1.0 or higher, then you can also monitor cluster links on the source cluster with the following metrics.
- kafka.server:type=cluster-link-source-metrics,request={request},link-id={linkUUID}
request-byte-total
request-total
response-byte-total
response-time-ns-avg
Setting quotas on resources and bandwidth usage
You can set various types of quotas to place caps on resources and bandwidth used for Cluster Linking on the source and destination clusters.
Destination cluster quotas
You can limit total usage for cluster links on each broker in the destination cluster by setting the broker config confluent.cluster.link.io.max.bytes.per.second.
Source cluster quotas
The source cluster is unaware of cluster links, but can limit usage for cluster links by allocating link-specific credentials and assigning quotas for the link principal.
Fetch byte-rate (replication throughput) for each cluster link principal
Request rate quota (CPU/thread usage) for each cluster link principal. This also includes CPU usage for metadata and configuration sync as well as ACL and consumer offset migration.
Monitoring Use Cases and Alert Recommendations
Here are common monitoring scenarios and recommended metrics for cluster link operations:
Health Monitoring
Link Status:
kafka.server:type=cluster-link-metrics,name=link-count,state=active- Should be 1 for healthy linksMirror Topic Health:
kafka.server:type=cluster-link-metrics,name=mirror-topic-count,state=Mirror- Track active mirror topicsFailed Links:
kafka.server:type=cluster-link-metrics,name=broker-failed-link-count- Alert when > 0
Performance Monitoring
Replication Lag:
kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag- Monitor cross-cluster lagThroughput:
kafka.server:type=FetcherStats,name=BytesPerSec- Track data transfer ratesMirror Topic Throughput:
kafka.server:type=BrokerTopicMetrics,name=MirrorBytesInPerSec- Track mirror topic write rates (CP 7.6.1+)Throttling:
kafka.server:type=cluster-link,name=fetch-throttle-time-avg- Monitor throttling impact
Error Monitoring
Task Failures:
kafka.server:type=cluster-link-metrics,name=link-task-count,state=in-error- Alert on task errorsConnection Issues:
kafka.server:type=cluster-link-metrics,name=reverse-connection-failed-total- Monitor connection failuresAuthentication Problems: Use
reason=authenticationorreason=authorizationtags to identify auth issues
Capacity Planning
Connection Count:
kafka.server:type=cluster-link-metrics,name=reverse-connection-count- Track connection usageQueue Sizes:
kafka.server:type=cluster-link-metrics,name=remote-admin-queue-size- Monitor queue backlogsThread Usage:
kafka.server:type=cluster-link-metrics,name=background-thread-usage- Track resource utilization