Monitor Cluster Metrics and Optimize Links for Cluster Linking on Confluent Platform

Confluent Platform exposes several metrics through Java Management Extensions (JMX) that are useful for monitoring Cluster Linking. Some of these are derived from extending existing interfaces, and others are new for Cluster Linking.

To monitor Cluster Linking, set the JMX_PORT environment variable before starting the cluster, then collect the reported metrics using your usual monitoring tools. JMXTrans, Graphite, and Grafana are a popular combination for collecting and reporting JMX metrics from Kafka. Datadog is another popular monitoring solution.

Quick Start for Administrators

For basic health monitoring, focus on link-count (state=active) and mirror-topic-count metrics
For performance monitoring, track MaxLag, BytesPerSec, and throttle time metrics
For troubleshooting, monitor link-task-count (state=in-error) and connection failure metrics
See the complete reference section for all available metrics
For comprehensive lists of all possible state and reason code values, see the Enumerated Tag Values Reference section

Also, you can set quotas to limit resources and bandwidth used by Cluster Linking to help optimize the system.

Monitoring brokers

Keep in mind that on KRaft these metrics apply only to brokers (not to the KRaft controller quorum). The broker that is the link coordinator will properly reflect the state of the link.

For example, the JMX MBean for Cluster Linking status is kafka.server:type=cluster-link-metrics,mode={mode},state={state},link-name={linkName}. The state can be active, paused, or unavailable. This MBean represents the total number of links in that specific state.

For the state=active MBean:

Link coordinator broker: Returns 1 if the cluster link is active and managed by this broker
Non-coordinator brokers: Return 0

Example: kafka.server:type=cluster-link-metrics,mode=source,state=active,link-name=my-link returns 1 on the link coordinator, 0 on other brokers.

In previous Confluent Platform releases that supported ZooKeeper (pre-8.0), the ZooKeeper “controller” broker acted as the link coordinator, so only that broker held the state of cluster links. The JMX MBean for Cluster Linking status against the “controller” would return 1 if the cluster link was active, but would return 0 on any other broker. ZooKeeper is no longer available on Confluent Platform 8.0 and later. To learn more, see KRaft and ZooKeeper in “What’s supported”.

Cluster link fetcher metrics

kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag,clientId=ClusterLink,link-name={linkName}

Maximum lag in messages between the replicas on the destination cluster and the leader replica on the source cluster. Provides an indication of overall cluster link health and a relative measure of whether the destination cluster is lagging behind.

kafka.server.link:type=ClusterLinkFetcherManager,name=FetchTimeMax,clientId=ClusterLink,link-name={linkName},state={state}

This metric shows the longest time gap between successful data fetches for a partition. The clock starts when a partition is ready to be used and stops when a successful fetch occurs (even if no new data comes in). If the time since the last fetch is longer than the delay from the previous fetch, that longer period is used. Partitions with unavailable, failed, or degraded links are not counted. Normally, the fetch interval is low. If you notice a consistently high interval, it could signal an underlying issue.

Available tags: - state: The fetch state (fetch for active fetching, other values for different states)

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},topic={topic},partition={partition},link-name={linkName}

Drill-down of MaxLag, which provides lag in number of messages per replica on the destination cluster from the leader replica on the source cluster. Useful to debug any specific lagging topic partitions/replicas on the destination cluster.

kafka.server.link:type=ClusterLinkFetcherManager,name=DeadThreadCount,clientId=ClusterLink,link-name={linkName}

Number of dead threads of type ClusterLinkThread, where a dead thread is indicative of an issue with fetching from the source cluster for a partition.

kafka.server.link:type=ClusterLinkFetcherManager,name=DegradedPartitionCount,clientId=ClusterLink,link-name={linkName}

Partitions that have been moved to degraded state, typically due to a non-critical error after retry timeout. These include user errors, such as authorization failure and includes link availability failures, indicated by the reason tag. A sustained value greater than (>) 0 beyond multiple metadata intervals and availability timeout may indicate an issue, depicted by the reason tag internal.

kafka.server:type=FetcherStats,name=BytesPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},brokerHost={host},brokerPort={port},link-name={linkName}

Rate at which data is fetched from the source cluster. Indicates amount of throughput in bytes per second on the cluster link.

kafka.server:type=FetcherStats,name=RequestsPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId}, brokerHost={host},brokerPort={port},link-name={linkName}

Rate at which requests are issued to the source cluster over the cluster link.

Throttle time metrics for cluster link fetchers

kafka.server:type=cluster-link,link-name={linkName}: Throttle time metrics may indicate that a cluster link connection is being throttled, which is useful for understanding why lag may be increasing over a cluster link.

Name	Description
`fetch-throttle-time-avg`, `fetch-throttle-time-max`	Gives throttle times for the Cluster Linking fetchers. May indicate increases in lag on the cluster link due to throttling/quotas being enforced.

Cluster link network client metrics

kafka.server:type=cluster-link-metadata-metrics,link-name={linkName}: Metrics pertaining to the metadata refresh client in the cluster link.
kafka.server:type=cluster-link-fetcher-metrics,link-name={linkName},broker-id={id},fetcher-id={id}: Metrics pertaining to data fetcher requests in the cluster link.

Metrics seen through cluster-link-fetcher-metrics are shown in the following table.

Name	Description
`connection-count`	Number of connections for a cluster link
`connection-creation-rate`, `connection-creation-total`	Rate per second and total for connections created for the cluster link. If the rate is high, the source cluster may be overloaded with connections.
`connection-close-rate`, `connection-close-total`	Rate per second and total for connections closed for the cluster link. Can be compared to connection creation to understand the balance of creating/closing connections.
`incoming-byte-rate`, `incoming-byte-total`	Number of bytes per second and total received from the source cluster.
`outgoing-byte-rate`, `outgoing-byte-total`	Number of bytes per second and total sent to the source cluster.
`network-io-rate`, `network-io-total`	Indicates rate and total network input/output (IO).
`request-rate`, `request-total`	Rate per second and total requests issued to the source cluster over the cluster link.
`response-rate`, `response-total`	Rate per second and total number of responses received from the source cluster over the cluster link.
`request-size-avg`, `request-size-max`	Average and maximum size in bytes of requests issued to the source cluster over the cluster link.
`io-ratio`, `io-wait-time-ns-avg`, `io-waittime-total`, `iotime-total`	Statistics of the destination cluster’s network IO for the requests over the cluster link.
`successful-authentication-rate`, `successful-authentication-total`	Rate per second and total number of cluster link clients authenticating to the source cluster over the cluster link.
`successful-reauthentication-rate`, `successful-reauthentication-total`	Rate per second and total re-authentication to the source cluster over the cluster link.
`failed-reauthentication-rate`, `failed-reauthentication-total`	Rate per second and total failed re-authentications to the source cluster over the cluster link. If failures are present, it could indicate misconfigured or stale cluster link credentials.
`reauthentication-latency-avg`	Average re-authentication latency. Helps to assess whether clients are taking too long to authenticate to the clusters.

Request metrics for cluster link APIs

kafka.network:type=RequestMetrics,name={ LocalTimeMs| RemoteTimeMs | RequestBytes | RequestQueueTimeMs| ResponseQueueTimeMs | ResponseSendIoTimeMs | ResponseSendTimeMs | ThrottleTimeMs| TotalTimeMs },request={CreateClusterLinks| DeleteClusterLinks| ListClusterLinks}: Depending on the request name, provides statistics on requests on the cluster link, including requests and response times, time requests wait in the queue, size of requests (bytes), and so forth.
kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=ClusterLink: Provides metrics on delayed operations on the cluster link.

Task status metrics

kafka.server:type=cluster-link-metrics,name=link-task-count,link-name={linkName},task-name={taskName},state={state},reason={reason},mode={mode},connection-mode={connection_mode}: Monitor the state of link level tasks. For example, monitor if consumer offset syncing is working. If the task is in error, a reason code is provided. You can set up alerts to trigger if errors occur.

Available tags:

task-name: The specific task being monitored. Possible values:

consumer-offset-sync: Consumer offset synchronization task
acl-sync: ACL synchronization task
auto-create-mirror: Automatic mirror topic creation task
topic-configs-sync: Topic configuration synchronization task
clear-mirror-start-offsets: Clear mirror start offsets task
pause-mirror-topics: Pause mirror topics task
check-availability: Availability check task
state-aggregator: State aggregation task
retry-task: Retry task for failed operations
periodic-partition-scheduler: Periodic partition scheduler task
degraded-partition-monitor: Degraded partition monitor task

state: The current state of the task. All possible values:

active: Task is currently running without errors
in-error: Task has encountered an error and cannot proceed

reason: Reason code indicating why a task is in error state. All possible values:

no-error: No error present (used when state is active)
authentication: Authentication failure with link credentials - link cannot authenticate to remote cluster
broker-authentication: Authentication failure with broker credentials - authentication issues between link and destination broker
authorization: Authorization failure with link credentials - link lacks required permissions on remote cluster
broker-authorization: Authorization failure with broker credentials - authorization issues between link and destination broker
misconfiguration: Configuration error - invalid or missing link configuration
internal: Internal or unexpected system error
suppressed-errors: Errors that are being suppressed and handled gracefully
consumer-group-in-use: Consumer group is currently active on destination - cannot sync offsets while consumers are active
remote-link-not-found: Remote link not found (for bidirectional links) - remote side of bidirectional link is missing
security-disabled: Remote cluster has no authorizer configured - remote cluster lacks security configuration
acl-limit-exceeded: ACL limit reached on destination cluster - ACL quota has been exceeded
invalid-request: Invalid request error - malformed or invalid API request
topic-exists: Topic already exists on destination - cannot create mirror topic because it already exists
policy-violation: Policy violation error - mirror topic creation violates cluster policies
invalid-topic: Invalid topic error - topic name or configuration is invalid
unknown-topic-or-partition: Topic or partition not found - source topic or partition doesn’t exist

kafka.server:type=cluster-link-metrics,name=mirror-transition-in-error,link-name={linkName},state={state},reason={reason},mode={mode},connection-mode={connection_mode}: Monitor mirror topic state transition errors. For example, if a mirror topic encounters errors during the promotion process; that is, while its state is pending_stopped and it is being transitioned to stopped.

Available tags:

state: The mirror topic transition state where an error occurred. All possible values:

pending_stopped: Mirror topic is in the process of being stopped (e.g., during promotion or failover)
pending_mirror: Topic is in the process of being converted to a mirror topic
pending_synchronization: Mirror topic is being reversed or swapped (bidirectional operations)
pending_restore: Mirror topic is in the process of being prepared for restore operations
pending_setup_for_restore: Mirror topic is being set up for restore operations
failed: Mirror topic repair or recovery operations in progress

reason: Reason code for the transition failure. Uses the same error codes as link task metrics (see above for complete list).

Metrics on the destination cluster

The following broker metrics, specific to Cluster Linking, are available on the brokers in the destination cluster.

kafka.server:type=cluster-link-metrics,name=mirror-partition-count,link-name={linkName}

Number of actively mirrored partitions for which the broker is leader

kafka.server.link:type=ClusterLinkFetcherManager,name=FailedPartitionsCount,clientId=ClusterLink,link-name={linkName}

Number of failed mirror partitions for which the broker is leader. A partition may be in a failed state if the source partition’s epoch has gone backwards; for example, if the topic was recreated.

kafka.server:type=ReplicaManager,name=UnderMinIsrMirrorPartitionCount

Number of mirrored partitions that are under min ISR.

kafka.server:type=ReplicaManager,name=UnderReplicatedMirrorPartitions

Number of mirrored partitions that are under replicated.

kafka.server:type=BrokerTopicMetrics,name=MirrorBytesInPerSec,topic={topic}

Rate of bytes per second written to mirror topics through Cluster Linking. This metric is available both per-topic (with topic tag) and as an aggregate across all mirror topics (without topic tag). Available since Confluent Platform 7.6.1.

This metric specifically tracks data written to the destination cluster’s mirror topics, which complements the existing FetcherStats metrics that track overall cluster link throughput. Use this metric for monitoring mirror topic write performance and comparing it with source topic throughput.

Available tags:

topic: The mirror topic name (present for per-topic metrics, absent for aggregate metrics)

kafka.server:type=cluster-link-metrics,name=(linked-leader-epoch-change-rate, linked-leader-epoch-change-total),link-name={linkName}

Rate per second and total number of times leader election was triggered on this broker due to source leader changes.

Frequent triggers for leader election might indicate issues on the source cluster. This can be a useful metric during on-premises to Confluent Cloud migrations to identify if there are issues on the source cluster.

kafka.server:type=cluster-link-metrics,name=(linked-topic-partition-addition-rate, linked-topic-partition-addition-total),link-name={linkName}

Rate per second and total number of times at which the partition count was updated due to source changes.

A high volume or rate of partition count updates might indicate issues on the source cluster.

kafka.server:type=cluster-link-metrics,state={state},link-name={link_name},name=mirror-topic-count

Total number of mirror topics in a specific state. The state tag indicates the current operational state of mirror topics.

Available mirror topic states (link_mirror_topic_state tag values):

Mirror: Active, healthy mirror topics that are successfully replicating from source
PausedMirror: Mirror topics for which mirroring from the source has been explicitly paused
PendingStoppedMirror: Mirror topics whose mirroring is being stopped (transitional state, e.g., during promotion)
FailedMirror: Failed mirror topics where mirroring cannot proceed due to errors
SOURCE_UNAVAILABLE: Mirror topics where the source cluster or topic is unavailable

Note

Note: Known issue: The Mirror state metric will include topics that are in the SOURCE_UNAVAILABLE state. As a temporary workaround, use kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=active,link-name={linkName} and kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=unavailable,link-name={linkName} to alert on the source absence.

kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state={state},link-name={linkName},reason={reason}: Total number of cluster links in a specific state. You can filter or group by direction (source or destination) and connection mode (inbound or outbound).

Available link states (link_state tag values):

active: Link is operational and successfully connected to the source cluster
paused: Link has been explicitly paused and is not actively replicating
unavailable: Link cannot connect to the source cluster or is experiencing connectivity issues. The reason tag provides specific details about why the link is unavailable (uses same reason codes as task metrics documented above)

kafka.server:type=cluster-link-metrics,state={state},link-name={link_name},mode={mode},connection-mode={connection_mode},reason={reason},name=link-count: Total number of links in a specific state. Use the state tag to filter by link state (see available link states above).

Additional common tags:

mode: The direction of the cluster link. Possible values:

destination: Links where this cluster acts as the destination
source: Links where this cluster acts as the source
bidirectional: Links that can operate in both directions

connection-mode: The connection mode for the cluster link. Possible values:

inbound: Connections coming into this cluster
outbound: Connections going out from this cluster

reason: (Only for unavailable links) The specific reason code why the link is unavailable. Uses the same reason codes as task metrics (see above for complete list).

kafka.server:type=cluster-link-metrics,link-name={link_name},name=broker-failed-link-count: Total number of links in a failed state. This metric is reported on a per broker state, as some or all brokers could have a link in a failed state and unable to connect to the source.
kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},remote-link-connection-mode={remote_connection_mode},link-name={link_name},name=controller-reverse-connection-count: Number of persistent reverse connections for this link between the local broker and the remote link coordinator.

Available tags:

remote-link-connection-mode: The connection mode on the remote side. Possible values:

inbound: Remote cluster accepts connections from this cluster
outbound: Remote cluster initiates connections to this cluster

kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-count: Total number of reverse connections for this link between local and remote brokers.
kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-created-total,reverse-connection-created-rate: Total count and rate per second of reverse connections created for this cluster link.
kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-closed-total,reverse-connection-closed-rate: Total count and rate per second of reverse connections closed for this cluster link.
kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-failed-total,reverse-connection-failed-rate: Total count and rate per second of reverse connections that failed for this cluster link. Only available for outbound connections.
kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-count: Number of link fetchers currently active for this cluster link.
kafka.server:type=cluster-link-metrics,pool={pool},link-name={link_name},name=link-target-fetcher-count: Number of cluster link target fetchers per source broker.

Available tags:

pool: The fetcher pool type. Possible values include different fetcher pool configurations.

kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-throttled-partition-count: Number of throttled partitions for this cluster link.
kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-unassigned-partition-count: Number of unassigned partitions for this cluster link.
kafka.server:type=cluster-link-metrics,reason={reason},link-name={link_name},name=link-fetcher-degraded-partition-count: Number of partitions in degraded state due to specific reasons.

Available tags:

reason: The reason for partition degradation. Specific degradation reasons include various operational issues.

kafka.server:type=cluster-link-metrics,link-name={link_name},name=remote-admin-queue-size: Backlog queue size for the remote admin client for this cluster link.

Metrics on the source cluster

The source cluster (with Kafka or Confluent Server previous to 6.1.0) is unaware of cluster links, but can monitor cluster link related usage by allocating link-specific credentials. Quota metrics for each credential is available in both Kafka and Confluent Platform when user quotas are enabled.

kafka.server:type=Fetch,user={linkPrincipal},client-id={optionalLinkClientId}

byte-rate
throttle-time

kafka.server:type=Request,user={linkPrincipal},client-id={optionalLinkClientId}

request-time
throttle-time

If the source cluster is Confluent Server with a version of 6.1.0 or higher, then you can also monitor cluster links on the source cluster with the following metrics.

kafka.server:type=cluster-link-source-metrics,request={request},link-id={linkUUID}

request-byte-total
request-total
response-byte-total
response-time-ns-avg

Setting quotas on resources and bandwidth usage

You can set various types of quotas to place caps on resources and bandwidth used for Cluster Linking on the source and destination clusters.

Destination cluster quotas

You can limit total usage for cluster links on each broker in the destination cluster by setting the broker config confluent.cluster.link.io.max.bytes.per.second.

Source cluster quotas

The source cluster is unaware of cluster links, but can limit usage for cluster links by allocating link-specific credentials and assigning quotas for the link principal.

Fetch byte-rate (replication throughput) for each cluster link principal
Request rate quota (CPU/thread usage) for each cluster link principal. This also includes CPU usage for metadata and configuration sync as well as ACL and consumer offset migration.

Monitoring Use Cases and Alert Recommendations

Here are common monitoring scenarios and recommended metrics for cluster link operations:

Health Monitoring

Link Status: kafka.server:type=cluster-link-metrics,name=link-count,state=active - Should be 1 for healthy links
Mirror Topic Health: kafka.server:type=cluster-link-metrics,name=mirror-topic-count,state=Mirror - Track active mirror topics
Failed Links: kafka.server:type=cluster-link-metrics,name=broker-failed-link-count - Alert when > 0

Performance Monitoring

Replication Lag: kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag - Monitor cross-cluster lag
Throughput: kafka.server:type=FetcherStats,name=BytesPerSec - Track data transfer rates
Mirror Topic Throughput: kafka.server:type=BrokerTopicMetrics,name=MirrorBytesInPerSec - Track mirror topic write rates (CP 7.6.1+)
Throttling: kafka.server:type=cluster-link,name=fetch-throttle-time-avg - Monitor throttling impact

Error Monitoring

Task Failures: kafka.server:type=cluster-link-metrics,name=link-task-count,state=in-error - Alert on task errors
Connection Issues: kafka.server:type=cluster-link-metrics,name=reverse-connection-failed-total - Monitor connection failures
Authentication Problems: Use reason=authentication or reason=authorization tags to identify auth issues

Capacity Planning

Connection Count: kafka.server:type=cluster-link-metrics,name=reverse-connection-count - Track connection usage
Queue Sizes: kafka.server:type=cluster-link-metrics,name=remote-admin-queue-size - Monitor queue backlogs
Thread Usage: kafka.server:type=cluster-link-metrics,name=background-thread-usage - Track resource utilization

Complete Metrics and Tags Reference

This section provides a comprehensive reference for all Cluster Linking metrics and their available tags.

Common Tags Used Across Multiple Metrics

Core Tags:

link-name={linkName}: The name of the cluster link
mode={mode}: Link direction (destination, source, bidirectional)
connection-mode={connectionMode}: Connection mode (inbound, outbound)
state={state}: Metric-specific state values
reason={reason}: Error codes or specific reasons

Specialized Tags:

remote-link-connection-mode={remoteConnectionMode}: Remote side connection mode
task-name={taskName}: Specific task being monitored
pool={pool}: Fetcher pool type
thread={threadId}: Background thread identifier

Link Status and Count Metrics

Metric Name	Tags	Description
`link-count`	`link-name`, `state`, `mode`, `connection-mode`, `reason` (unavailable only)	Number of links in specified state
`broker-failed-link-count`	`link-name`	Links in failed state per broker
`remote-link-state-count`	`link-name`, `remote-link-state`, `mode`, `connection-mode`	Count by remote link state
`unavailable-link-count`	`link-name`, `reason`, `mode`, `connection-mode`	Unavailable links with reason
`intranet-connectivity-link-count`	`link-name`, `mode`, `connection-mode`	Links using intranet connectivity
`prefixed-destination-link-count`	None	Links with prefix enabled

Mirror Topic and Partition Metrics

Metric Name	Tags	Description
`mirror-partition-count`	`link-name`	Active mirror partitions
`mirror-topic-count`	`link-name`, `state`	Mirror topics by state
`failed-mirror-topic-count`	`link-name`, `reason`, `mode`, `connection-mode`	Failed mirror topics with reason
`MirrorBytesInPerSec`	`topic` (optional)	Bytes per second written to mirror topics (CP 7.6.1+)

Connection and Network Metrics

Metric Name	Tags	Description
`controller-reverse-connection-count`	`link-name`, `mode`, `connection-mode`, `remote-link-connection-mode`	Persistent controller connections
`reverse-connection-count`	`link-name`, `mode`, `connection-mode`	Total reverse connections
`reverse-connection-created-total\|rate`	`link-name`, `mode`, `connection-mode`	Reverse connections created
`reverse-connection-closed-total\|rate`	`link-name`, `mode`, `connection-mode`	Reverse connections closed
`reverse-connection-failed-total\|rate`	`link-name`, `mode`, `connection-mode`	Failed reverse connections

Fetcher and Performance Metrics

Metric Name	Tags	Description
`link-fetcher-count`	`link-name`	Active link fetchers
`link-target-fetcher-count`	`link-name`, `pool`	Target fetchers per pool
`link-fetcher-throttled-partition-count`	`link-name`	Throttled partitions
`link-fetcher-unassigned-partition-count`	`link-name`	Unassigned partitions
`link-fetcher-degraded-partition-count`	`link-name`, `reason`	Degraded partitions with reason
`fetch-throttle-time-avg\|max`	`link-name`	Fetch throttle time statistics

Task Status Metrics

Metric Name	Tags	Description
`link-task-count`	`link-name`, `task-name`, `state`, `reason`, `mode`, `connection-mode`	Task status monitoring
`mirror-transition-in-error`	`link-name`, `state`, `reason`, `mode`, `connection-mode`	Mirror transition errors

Administrative and Queue Metrics

Metric Name	Tags	Description
`remote-admin-queue-size`	`mode`, `connection-mode`	Remote admin queue backlog
`remote-admin-queue-time-ms-avg\|max`	`link-name`	Admin queue wait time
`remote-admin-request-time-ms-avg\|max`	`link-name`	Admin request processing time
`local-admin-queue-size`	`thread`	Local admin queue per thread
`local-admin-queue-time-ms-avg\|max`	`link-name`	Local admin queue wait time
`local-admin-request-time-ms-avg\|max`	`link-name`	Local admin request processing time

Data Sync and Topic Management Metrics

Metric Name	Tags	Description
`linked-topic-partition-addition-total\|rate`	`link-name`	Topic partition additions
`linked-leader-epoch-change-total\|rate`	`link-name`	Leader epoch changes
`acls-added-total\|rate`	`link-name`	ACLs successfully added
`acls-add-failed-total\|rate`	`link-name`	Failed ACL additions
`acls-deleted-total\|rate`	`link-name`	ACLs successfully deleted
`acls-delete-failed-total\|rate`	`link-name`	Failed ACL deletions
`describe-acls-from-source-failed-total\|rate`	`link-name`	Failed ACL descriptions
`consumer-offset-committed-total\|rate`	`link-name`	Consumer offsets synced
`consumer-offset-commit-failed-total\|rate`	`link-name`	Failed offset commits
`topic-config-update-total\|rate`	`link-name`	Topic config updates
`topic-config-update-failed-total\|rate`	`link-name`	Failed config updates

Auto-Mirror and Automation Metrics

Metric Name	Tags	Description
`auto-mirror-created-total\|rate`	`link-name`	Auto-created mirror topics
`auto-mirror-create-failed-total\|rate`	`link-name`	Failed auto-mirror creation
`auto-mirror-list-topics-from-source-failed-total\|rate`	`link-name`	Failed source topic listing
`auto-mirror-list-topics-from-destination-failed-total\|rate`	`link-name`	Failed destination topic listing
`auto-mirror-list-mirrors-from-source-failed-total\|rate`	`link-name`	Failed mirror listing from source
`list-consumer-group-offsets-from-source-failed-total\|rate`	`link-name`	Failed offset listing from source
`list-consumer-group-offsets-from-destination-failed-total\|rate`	`link-name`	Failed offset listing from destination
`list-consumer-groups-from-source-failed-total\|rate`	`link-name`	Failed consumer group listing
`prefixed-auto-mirror-created-total\|rate`	`link-name`	Prefixed auto-created mirrors
`prefixed-auto-mirror-create-failed-total\|rate`	`link-name`	Failed prefixed auto-creation
`prefixed-auto-mirror-topic-filtered-count`	`link-name`	Topics filtered out from mirroring

Throttle and Performance Timing Metrics

Metric Name	Tags	Description
`destination-lag-link-fetcher-throttle-total\|rate`	`link-name`	Throttle due to destination lag
`link-fetcher-produce-throttle-total\|rate`	`link-name`	Produce quota throttling
`link-fetcher-request-throttle-total\|rate`	`link-name`	Request quota throttling
`link-fetcher-fetch-time-avg\|max`	`link-name`, `state`	Fetch timing (state=fetch or wait)
`time-to-stop-mirror-topic-failover-ms-avg\|max`	`link-name`	Time to failover mirror topic
`time-to-stop-mirror-topic-promote-ms-avg\|max`	`link-name`	Time to promote mirror topic
`link-unavailable-total\|rate`	`link-name`, `mode`, `connection-mode`	Link unavailability incidents

Background Thread and Resource Metrics

Metric Name	Tags	Description
`background-thread-usage`	`thread`	Background thread active usage
`background-thread-tenants`	`thread`	Tenants per background thread
`background-thread-link-coordinators`	`thread`	Link coordinators per thread
`background-thread-max-task-wait-time-ms`	`thread`	Max task wait time per thread
`fetch-size-allocated-total`	`pool`	Total fetch size allocated
`tenant-fetch-size-allocated-min`	`pool`	Min tenant fetch allocation
`tenant-fetch-size-allocated-max`	`pool`	Max tenant fetch allocation

Multi-Tenant Specific Metrics

Metric Name	Tags	Description
`metadata-topic-owned-partitions-count`	None	Metadata topic partitions owned
`link-count-with-owned-link-coordinator`	None	Links with owned coordinator
`max-link-count-per-metadata-topic-partition`	None	Max links per metadata partition

Enumerated Tag Values Reference

This section provides all possible values for key metric tags used across cluster linking metrics. These values appear in various metrics throughout the document. For detailed explanations of each value in context, see the Task status metrics section.

link_task_state Values

active: Task is currently running without errors
in-error: Task has encountered an error and cannot proceed

link_state Values

active: Link is operational and successfully connected to the source cluster
paused: Link has been explicitly paused and is not actively replicating
unavailable: Link cannot connect to the source cluster or is experiencing connectivity issues

link_mirror_topic_state Values

Mirror: Active, healthy mirror topics that are successfully replicating from source
PausedMirror: Mirror topics for which mirroring from the source has been explicitly paused
PendingStoppedMirror: Mirror topics whose mirroring is being stopped (transitional state)
FailedMirror: Failed mirror topics where mirroring cannot proceed due to errors
SOURCE_UNAVAILABLE: Mirror topics where the source cluster or topic is unavailable

Mirror Topic Transition State Values

These states appear in the mirror-transition-in-error metric:

pending_stopped: Mirror topic is in the process of being stopped (e.g., during promotion or failover)
pending_mirror: Topic is in the process of being converted to a mirror topic
pending_synchronization: Mirror topic is being reversed or swapped (bidirectional operations)
pending_restore: Mirror topic is in the process of being prepared for restore operations
pending_setup_for_restore: Mirror topic is being set up for restore operations
failed: Mirror topic repair or recovery operations in progress

Reason Code Values (link_task_reason, link_mirror_topic_reason)

No Error:

no-error: No error present (used when state is active)

Authentication/Authorization:

authentication: Authentication failure with link credentials - link cannot authenticate to remote cluster
broker-authentication: Authentication failure with broker credentials - authentication issues between link and destination broker
authorization: Authorization failure with link credentials - link lacks required permissions on remote cluster
broker-authorization: Authorization failure with broker credentials - authorization issues between link and destination broker
security-disabled: Remote cluster has no authorizer configured - remote cluster lacks security configuration

Configuration Issues:

misconfiguration: Configuration error - invalid or missing link configuration
policy-violation: Policy violation error - mirror topic creation violates cluster policies
invalid-request: Invalid request error - malformed or invalid API request

Resource Issues:

topic-exists: Topic already exists on destination - cannot create mirror topic because it already exists
invalid-topic: Invalid topic error - topic name or configuration is invalid
unknown-topic-or-partition: Topic or partition not found - source topic or partition doesn’t exist
acl-limit-exceeded: ACL limit reached on destination cluster - ACL quota has been exceeded
consumer-group-in-use: Consumer group is currently active on destination - cannot sync offsets while consumers are active

Operational Issues:

internal: Internal or unexpected system error
suppressed-errors: Errors that are being suppressed and handled gracefully
remote-link-not-found: Remote link not found (for bidirectional links) - remote side of bidirectional link is missing
remote-mirror-not-found: Remote mirror topic not found
link-not-found: Link not found

State Transition Issues:

unexpected-mirror-state: Mirror topic is in an unexpected state
failed-to-update-metadata: Failed to update metadata for mirror topic
failover-timeout: Timeout occurred during failover operation

Monitor Cluster Metrics and Optimize Links for Cluster Linking on Confluent Platform

Quick Start for Administrators

Monitoring brokers

Cluster link fetcher metrics

Throttle time metrics for cluster link fetchers

Cluster link network client metrics

Request metrics for cluster link APIs

Task status metrics

Metrics on the destination cluster

Metrics on the source cluster

Setting quotas on resources and bandwidth usage

Destination cluster quotas

Source cluster quotas

Monitoring Use Cases and Alert Recommendations

Health Monitoring

Performance Monitoring

Error Monitoring

Capacity Planning

Complete Metrics and Tags Reference

Common Tags Used Across Multiple Metrics

Link Status and Count Metrics

Mirror Topic and Partition Metrics

Connection and Network Metrics

Fetcher and Performance Metrics

Task Status Metrics

Administrative and Queue Metrics

Data Sync and Topic Management Metrics

Auto-Mirror and Automation Metrics

Throttle and Performance Timing Metrics

Background Thread and Resource Metrics

Multi-Tenant Specific Metrics

Task Names for task-name Tag

Enumerated Tag Values Reference

link_task_state Values

link_state Values

link_mirror_topic_state Values

Mirror Topic Transition State Values

Reason Code Values (link_task_reason, link_mirror_topic_reason)

Related content