Monitor Cluster Metrics and Optimize Links for Cluster Linking on Confluent Platform¶
Confluent Platform exposes several metrics through Java Management Extensions (JMX) that are useful for monitoring Cluster Linking. Some of these are derived from extending existing interfaces, and others are new for Cluster Linking.
To monitor Cluster Linking, set the JMX_PORT
environment variable before
starting the cluster, then collect the reported metrics using your usual
monitoring tools. JMXTrans, Graphite, and Grafana are a popular combination for
collecting and reporting JMX metrics from Kafka. Datadog is another popular
monitoring solution.
Quick Start for Administrators¶
- For basic health monitoring, focus on
link-count
(state=active) andmirror-topic-count
metrics - For performance monitoring, track
MaxLag
,BytesPerSec
, and throttle time metrics - For troubleshooting, monitor
link-task-count
(state=in-error) and connection failure metrics - See the complete reference section for all available metrics
Also, you can set quotas to limit resources and bandwidth used by Cluster Linking to help optimize the system.
Monitoring brokers¶
Keep in mind that on KRaft these metrics apply only to brokers (not to the KRaft controller quorum). The broker that is the link coordinator will properly reflect the state of the link.
For example, the JMX MBean for Cluster Linking status is kafka.server:type=cluster-link-metrics,mode={mode},state={state},link-name={linkName}
. The state can be active
, paused
, or unavailable
. This MBean represents the total number of links in that specific state.
For the state=active MBean
:
- Link coordinator broker: Returns
1
if the cluster link is active and managed by this broker - Non-coordinator brokers: Return
0
Example: kafka.server:type=cluster-link-metrics,mode=source,state=active,link-name=my-link
returns 1
on the link coordinator, 0
on other brokers.
In previous Confluent Platform releases that supported ZooKeeper (pre-8.0), the ZooKeeper “controller” broker acted as the link coordinator, so only that broker held the state of cluster links. The JMX MBean for Cluster Linking status against the “controller” would return 1
if the cluster link was active, but would return 0
on any other broker.
ZooKeeper is no longer available on Confluent Platform 8.0 and later. To learn more, see KRaft and ZooKeeper in “What’s supported”.
Cluster link fetcher metrics¶
- kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag,clientId=ClusterLink,link-name={linkName}
- Maximum lag in messages between the replicas on the destination cluster and the leader replica on the source cluster. Provides an indication of overall cluster link health and a relative measure of whether the destination cluster is lagging behind.
- kafka.server.link:type=ClusterLinkFetcherManager,name=FetchTimeMax,clientId=ClusterLink,link-name={linkName},state={state}
This metric shows the longest time gap between successful data fetches for a partition. The clock starts when a partition is ready to be used and stops when a successful fetch occurs (even if no new data comes in). If the time since the last fetch is longer than the delay from the previous fetch, that longer period is used. Partitions with unavailable, failed, or degraded links are not counted. Normally, the fetch interval is low. If you notice a consistently high interval, it could signal an underlying issue.
Available tags: -
state
: The fetch state (fetch
for active fetching, other values for different states)- kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},topic={topic},partition={partition},link-name={linkName}
- Drill-down of
MaxLag
, which provides lag in number of messages per replica on the destination cluster from the leader replica on the source cluster. Useful to debug any specific lagging topic partitions/replicas on the destination cluster. - kafka.server.link:type=ClusterLinkFetcherManager,name=DeadThreadCount,clientId=ClusterLink,link-name={linkName}
- Number of dead threads of type
ClusterLinkThread
, where a dead thread is indicative of an issue with fetching from the source cluster for a partition. - kafka.server.link:type=ClusterLinkFetcherManager,name=DegradedPartitionCount,clientId=ClusterLink,link-name={linkName}
- Partitions that have been moved to degraded state, typically due to a non-critical error after retry timeout. These include user errors, such as authorization failure
and includes link availability failures, indicated by the reason tag. A sustained value greater than (>)
0
beyond multiple metadata intervals and availability timeout may indicate an issue, depicted by the reason taginternal
. - kafka.server:type=FetcherStats,name=BytesPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId},brokerHost={host},brokerPort={port},link-name={linkName}
- Rate at which data is fetched from the source cluster. Indicates amount of throughput in bytes per second on the cluster link.
- kafka.server:type=FetcherStats,name=RequestsPerSec,clientId=ClusterLinkFetcherThread-{destBrokerId}-{linkName}-{sourceBrokerId}, brokerHost={host},brokerPort={port},link-name={linkName}
- Rate at which requests are issued to the source cluster over the cluster link.
Throttle time metrics for cluster link fetchers¶
- kafka.server:type=cluster-link,link-name={linkName}
- Throttle time metrics may indicate that a cluster link connection is being throttled, which is useful for understanding why lag may be increasing over a cluster link.
Name | Description |
---|---|
fetch-throttle-time-avg , fetch-throttle-time-max |
Gives throttle times for the Cluster Linking fetchers. May indicate increases in lag on the cluster link due to throttling/quotas being enforced. |
Cluster link network client metrics¶
- kafka.server:type=cluster-link-metadata-metrics,link-name={linkName}
- Metrics pertaining to the metadata refresh client in the cluster link.
- kafka.server:type=cluster-link-fetcher-metrics,link-name={linkName},broker-id={id},fetcher-id={id}
- Metrics pertaining to data fetcher requests in the cluster link.
Metrics seen through cluster-link-fetcher-metrics
are shown in the following table.
Name | Description |
---|---|
connection-count |
Number of connections for a cluster link |
connection-creation-rate , connection-creation-total |
Rate per second and total for connections created for the cluster link. If the rate is high, the source cluster may be overloaded with connections. |
connection-close-rate , connection-close-total |
Rate per second and total for connections closed for the cluster link. Can be compared to connection creation to understand the balance of creating/closing connections. |
incoming-byte-rate , incoming-byte-total |
Number of bytes per second and total received from the source cluster. |
outgoing-byte-rate , outgoing-byte-total |
Number of bytes per second and total sent to the source cluster. |
network-io-rate , network-io-total |
Indicates rate and total network input/output (IO). |
request-rate , request-total |
Rate per second and total requests issued to the source cluster over the cluster link. |
response-rate , response-total |
Rate per second and total number of responses received from the source cluster over the cluster link. |
request-size-avg , request-size-max |
Average and maximum size in bytes of requests issued to the source cluster over the cluster link. |
io-ratio , io-wait-time-ns-avg , io-waittime-total , iotime-total |
Statistics of the destination cluster’s network IO for the requests over the cluster link. |
successful-authentication-rate , successful-authentication-total |
Rate per second and total number of cluster link clients authenticating to the source cluster over the cluster link. |
successful-reauthentication-rate , successful-reauthentication-total |
Rate per second and total re-authentication to the source cluster over the cluster link. |
failed-reauthentication-rate , failed-reauthentication-total |
Rate per second and total failed re-authentications to the source cluster over the cluster link. If failures are present, it could indicate misconfigured or stale cluster link credentials. |
reauthentication-latency-avg |
Average re-authentication latency. Helps to assess whether clients are taking too long to authenticate to the clusters. |
Request metrics for cluster link APIs¶
- kafka.network:type=RequestMetrics,name={ LocalTimeMs| RemoteTimeMs | RequestBytes | RequestQueueTimeMs| ResponseQueueTimeMs | ResponseSendIoTimeMs | ResponseSendTimeMs | ThrottleTimeMs| TotalTimeMs },request={CreateClusterLinks| DeleteClusterLinks| ListClusterLinks}
- Depending on the request name, provides statistics on requests on the cluster link, including requests and response times, time requests wait in the queue, size of requests (bytes), and so forth.
- kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=ClusterLink
- Provides metrics on delayed operations on the cluster link.
Task status metrics¶
- kafka.server:type=cluster-link-metrics,name=link-task-count,link-name={linkName},task-name={taskName},state={state},reason={reason},mode={mode},connection-mode={connection_mode}
- Monitor the state of link level tasks. For example, monitor if consumer offset syncing is working. If the task is in error, a reason code is provided. You can set up alerts to trigger if errors occur.
Available tags:
task-name
: The specific task being monitored. Possible values:consumer-offset-sync
: Consumer offset synchronization taskacl-sync
: ACL synchronization taskauto-create-mirror
: Automatic mirror topic creation tasktopic-configs-sync
: Topic configuration synchronization taskclear-mirror-start-offsets
: Clear mirror start offsets taskpause-mirror-topics
: Pause mirror topics taskcheck-availability
: Availability check taskstate-aggregator
: State aggregation taskretry-task
: Retry task for failed operationsperiodic-partition-scheduler
: Periodic partition scheduler taskdegraded-partition-monitor
: Degraded partition monitor task
state
: The current state of the task. Possible values:active
: Task is currently runningin-error
: Task has encountered an error
reason
: Error code when the task is in error state. Common values include:no-error
: No error (when state is active)authentication
: Authentication errors with link credentials [Link cannot authenticate to remote cluster]
broker-authentication
: Authentication errors with broker credentials [Authentication issues between link and destination broker]authorization
: Authorization errors with link credentials [Link lacks permissions on remote cluster]broker-authorization
: Authorization errors with broker credentials [Authorization issues between link and destination broker]misconfiguration
: Configuration errors [Invalid or missing link configuration]internal
: Internal/unexpected errors [Unexpected system errors]suppressed-errors
: Errors that are being suppressed [Errors being handled gracefully]consumer-group-in-use
: Consumer group is active on destination [Cannot sync offsets while consumers are active]remote-link-not-found
: Remote link not found (bidirectional links) [Remote side of bidirectional link missing]security-disabled
: Remote cluster has no authorizer configured [Remote cluster lacks security configuration]acl-limit-exceeded
: ACL limit reached on destination cluster [ACL quota exceeded on destination]invalid-request
: Invalid request error [Malformed or invalid API request]topic-exists
: Topic already exists on destination [Cannot create mirror topic - already exists]policy-violation
: Policy violation error [Mirror topic creation violates policies]invalid-topic
: Invalid topic error [Topic name or configuration invalid]unknown-topic-or-partition
: Topic or partition not found [Source topic/partition doesn’t exist]
- kafka.server:type=cluster-link-metrics,name=mirror-transition-in-error,link-name={linkName},state={state},reason={reason},mode={mode},connection-mode={connection_mode}
- Monitor mirror topic state transition errors. For example, if a mirror topic encounters errors during
the promotion process; that is, while its state is
pending_stopped
and it is being transitioned to stopped.
Available tags:
state
: The mirror topic transition state. Common values include:pending_stopped
: Mirror topic is being stoppedpending_mirror
: Topic is being converted to mirrorpending_synchronization
: Mirror topic is being reversed/swappedpending_restore
: Mirror topic is being prepared for restorepending_setup_for_restore
: Mirror topic is being prepared for restore setupfailed
: Mirror topic repair operations
reason
: Error code for the transition failure. Uses the same error codes as task metrics (see above).
Metrics on the destination cluster¶
Starting with Confluent Platform 7.0.0, the following broker metrics, specific to Cluster Linking, are available on the brokers in the destination cluster.
- kafka.server:type=cluster-link-metrics,name=mirror-partition-count,link-name={linkName}
- Number of actively mirrored partitions for which the broker is leader
- kafka.server.link:type=ClusterLinkFetcherManager,name=FailedPartitionsCount,clientId=ClusterLink,link-name={linkName}
- Number of failed mirror partitions for which the broker is leader. A partition may be in a failed state if the source partition’s epoch has gone backwards; for example, if the topic was recreated.
- kafka.server:type=ReplicaManager,name=UnderMinIsrMirrorPartitionCount
- Number of mirrored partitions that are under min ISR.
- kafka.server:type=ReplicaManager,name=UnderReplicatedMirrorPartitions
- Number of mirrored partitions that are under replicated.
- kafka.server:type=BrokerTopicMetrics,name=MirrorBytesInPerSec,topic={topic}
Rate of bytes per second written to mirror topics through Cluster Linking. This metric is available both per-topic (with topic tag) and as an aggregate across all mirror topics (without topic tag). Available since Confluent Platform 7.6.1.
This metric specifically tracks data written to the destination cluster’s mirror topics, which complements the existing
FetcherStats
metrics that track overall cluster link throughput. Use this metric for monitoring mirror topic write performance and comparing it with source topic throughput.Available tags:
topic
: The mirror topic name (present for per-topic metrics, absent for aggregate metrics)
- kafka.server:type=cluster-link-metrics,name=(linked-leader-epoch-change-rate, linked-leader-epoch-change-total),link-name={linkName}
Rate per second and total number of times leader election was triggered on this broker due to source leader changes.
- Frequent triggers for leader election might indicate issues on the source cluster. This can be a useful metric during on-premises to Confluent Cloud migrations to identify if there are issues on the source cluster.
- kafka.server:type=cluster-link-metrics,name=(linked-topic-partition-addition-rate, linked-topic-partition-addition-total),link-name={linkName}
Rate per second and total number of times at which the partition count was updated due to source changes.
- A high volume or rate of partition count updates might indicate issues on the source cluster.
- kafka.server:type=cluster-link-metrics,state=Mirror,link-name={link_name},name=mirror-topic-count
Total number of active (healthy) mirror topics.
Note
Note: Known issue: This metric will include topics that are in the
SOURCE_UNAVAILABLE
state. As a temporary workaround, usekafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=active,link-name={linkName}
andkafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=unavailable,link-name={linkName}
to alert on the source absence.- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=active,link-name={linkName}
- Total number of active cluster links connected to the cluster. You can filter or group by direction (source or destination) and connection mode (inbound or outbound).
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=paused,link-name={linkName}
- Total number of paused cluster links connected to the cluster. You can filter or group by direction (source or destination) and connection mode (inbound or outbound).
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},state=unavailable,link-name={linkName},reason={reason}
- Total number of unavailable cluster links connected to the cluster. You can filter or group by direction (source or destination) and connection mode (inbound or outbound). The reason tag provides specific details about why the link is unavailable.
- kafka.server:type=cluster-link-metrics,state=PausedMirror,link-name={link_name},name=mirror-topic-count
- Total number of mirror topics for which mirroring from the source has been paused.
- kafka.server:type=cluster-link-metrics,state=PendingStoppedMirror,link-name={link_name},name=mirror-topic-count
- Total number of mirror topics whose mirroring has been temporarily stopped.
- kafka.server:type=cluster-link-metrics,state=FailedMirror,link-name={link_name},name=mirror-topic-count
- Total number of failed mirror topics; that is, mirroring cannot proceed.
- kafka.server:type=cluster-link-metrics,state=active,link-name={link_name},mode={mode},connection-mode={connection_mode},name=link-count
- Total number of links in an active state.
Available tags:
mode
: The direction of the cluster link. Possible values:destination
: Links where this cluster acts as the destinationsource
: Links where this cluster acts as the sourcebidirectional
: Links that can operate in both directions
connection-mode
: The connection mode for the cluster link. Possible values:inbound
: Connections coming into this clusteroutbound
: Connections going out from this cluster
- kafka.server:type=cluster-link-metrics,state=paused,link-name={link_name},mode={mode},connection-mode={connection_mode},name=link-count
- Total number of links in a paused state.
- kafka.server:type=cluster-link-metrics,state=unavailable,link-name={link_name},mode={mode},connection-mode={connection_mode},reason={reason},name=link-count
- Total number of links in an unavailable state, where the source may not be available the link is having trouble fetching.
Additional tags for unavailable links:
reason
: The specific reason why the link is unavailable. Common values include authentication failures, authorization issues, network connectivity problems, and internal errors.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=broker-failed-link-count
- Total number of links in a failed state. This metric is reported on a per broker state, as some or all brokers could have a link in a failed state and unable to connect to the source.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},remote-link-connection-mode={remote_connection_mode},link-name={link_name},name=controller-reverse-connection-count
- Number of persistent reverse connections for this link between the local broker and the remote link coordinator.
Available tags:
remote-link-connection-mode
: The connection mode on the remote side. Possible values:inbound
: Remote cluster accepts connections from this clusteroutbound
: Remote cluster initiates connections to this cluster
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-count
- Total number of reverse connections for this link between local and remote brokers.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-created-total,reverse-connection-created-rate
- Total count and rate per second of reverse connections created for this cluster link.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-closed-total,reverse-connection-closed-rate
- Total count and rate per second of reverse connections closed for this cluster link.
- kafka.server:type=cluster-link-metrics,mode={mode},connection-mode={connection_mode},link-name={link_name},name=reverse-connection-failed-total,reverse-connection-failed-rate
- Total count and rate per second of reverse connections that failed for this cluster link. Only available for outbound connections.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-count
- Number of link fetchers currently active for this cluster link.
- kafka.server:type=cluster-link-metrics,pool={pool},link-name={link_name},name=link-target-fetcher-count
- Number of cluster link target fetchers per source broker.
Available tags:
pool
: The fetcher pool type. Possible values include different fetcher pool configurations.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-throttled-partition-count
- Number of throttled partitions for this cluster link.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=link-fetcher-unassigned-partition-count
- Number of unassigned partitions for this cluster link.
- kafka.server:type=cluster-link-metrics,reason={reason},link-name={link_name},name=link-fetcher-degraded-partition-count
- Number of partitions in degraded state due to specific reasons.
Available tags:
reason
: The reason for partition degradation. Specific degradation reasons include various operational issues.
- kafka.server:type=cluster-link-metrics,link-name={link_name},name=remote-admin-queue-size
- Backlog queue size for the remote admin client for this cluster link.
Metrics on the source cluster¶
The source cluster (with Kafka or Confluent Server previous to 6.1.0) is unaware of cluster links, but can monitor cluster link related usage by allocating link-specific credentials. Quota metrics for each credential is available in both Kafka and Confluent Platform when user quotas are enabled.
- kafka.server:type=Fetch,user={linkPrincipal},client-id={optionalLinkClientId}
- byte-rate
- throttle-time
- kafka.server:type=Request,user={linkPrincipal},client-id={optionalLinkClientId}
- request-time
- throttle-time
If the source cluster is Confluent Server with a version of 6.1.0 or higher, then you can also monitor cluster links on the source cluster with the following metrics.
- kafka.server:type=cluster-link-source-metrics,request={request},link-id={linkUUID}
- request-byte-total
- request-total
- response-byte-total
- response-time-ns-avg
Setting quotas on resources and bandwidth usage¶
You can set various types of quotas to place caps on resources and bandwidth used for Cluster Linking on the source and destination clusters.
Destination cluster quotas¶
You can limit total usage for cluster links on each broker in the destination cluster by setting the broker config confluent.cluster.link.io.max.bytes.per.second
.
Source cluster quotas¶
The source cluster is unaware of cluster links, but can limit usage for cluster links by allocating link-specific credentials and assigning quotas for the link principal.
- Fetch byte-rate (replication throughput) for each cluster link principal
- Request rate quota (CPU/thread usage) for each cluster link principal. This also includes CPU usage for metadata and configuration sync as well as ACL and consumer offset migration.
Monitoring Use Cases and Alert Recommendations¶
Here are common monitoring scenarios and recommended metrics for cluster link operations:
Health Monitoring¶
- Link Status:
kafka.server:type=cluster-link-metrics,name=link-count,state=active
- Should be 1 for healthy links - Mirror Topic Health:
kafka.server:type=cluster-link-metrics,name=mirror-topic-count,state=Mirror
- Track active mirror topics - Failed Links:
kafka.server:type=cluster-link-metrics,name=broker-failed-link-count
- Alert when > 0
Performance Monitoring¶
- Replication Lag:
kafka.server.link:type=ClusterLinkFetcherManager,name=MaxLag
- Monitor cross-cluster lag - Throughput:
kafka.server:type=FetcherStats,name=BytesPerSec
- Track data transfer rates - Mirror Topic Throughput:
kafka.server:type=BrokerTopicMetrics,name=MirrorBytesInPerSec
- Track mirror topic write rates (CP 7.6.1+) - Throttling:
kafka.server:type=cluster-link,name=fetch-throttle-time-avg
- Monitor throttling impact
Error Monitoring¶
- Task Failures:
kafka.server:type=cluster-link-metrics,name=link-task-count,state=in-error
- Alert on task errors - Connection Issues:
kafka.server:type=cluster-link-metrics,name=reverse-connection-failed-total
- Monitor connection failures - Authentication Problems: Use
reason=authentication
orreason=authorization
tags to identify auth issues
Capacity Planning¶
- Connection Count:
kafka.server:type=cluster-link-metrics,name=reverse-connection-count
- Track connection usage - Queue Sizes:
kafka.server:type=cluster-link-metrics,name=remote-admin-queue-size
- Monitor queue backlogs - Thread Usage:
kafka.server:type=cluster-link-metrics,name=background-thread-usage
- Track resource utilization
Complete Metrics and Tags Reference¶
This section provides a comprehensive reference for all Cluster Linking metrics and their available tags.
Common Tags Used Across Multiple Metrics¶
Core Tags:
link-name={linkName}
: The name of the cluster linkmode={mode}
: Link direction (destination
,source
,bidirectional
)connection-mode={connectionMode}
: Connection mode (inbound
,outbound
)state={state}
: Metric-specific state valuesreason={reason}
: Error codes or specific reasons
Specialized Tags:
remote-link-connection-mode={remoteConnectionMode}
: Remote side connection modetask-name={taskName}
: Specific task being monitoredpool={pool}
: Fetcher pool typethread={threadId}
: Background thread identifier
Link Status and Count Metrics¶
Metric Name | Tags | Description |
---|---|---|
link-count |
link-name , state , mode , connection-mode , reason (unavailable only) |
Number of links in specified state |
broker-failed-link-count |
link-name |
Links in failed state per broker |
remote-link-state-count |
link-name , remote-link-state , mode , connection-mode |
Count by remote link state |
unavailable-link-count |
link-name , reason , mode , connection-mode |
Unavailable links with reason |
intranet-connectivity-link-count |
link-name , mode , connection-mode |
Links using intranet connectivity |
prefixed-destination-link-count |
None | Links with prefix enabled |
Mirror Topic and Partition Metrics¶
Metric Name | Tags | Description |
---|---|---|
mirror-partition-count |
link-name |
Active mirror partitions |
mirror-topic-count |
link-name , state |
Mirror topics by state |
failed-mirror-topic-count |
link-name , reason , mode , connection-mode |
Failed mirror topics with reason |
MirrorBytesInPerSec |
topic (optional) |
Bytes per second written to mirror topics (CP 7.6.1+) |
Connection and Network Metrics¶
Metric Name | Tags | Description |
---|---|---|
controller-reverse-connection-count |
link-name , mode , connection-mode , remote-link-connection-mode |
Persistent controller connections |
reverse-connection-count |
link-name , mode , connection-mode |
Total reverse connections |
reverse-connection-created-total|rate |
link-name , mode , connection-mode |
Reverse connections created |
reverse-connection-closed-total|rate |
link-name , mode , connection-mode |
Reverse connections closed |
reverse-connection-failed-total|rate |
link-name , mode , connection-mode |
Failed reverse connections |
Fetcher and Performance Metrics¶
Metric Name | Tags | Description |
---|---|---|
link-fetcher-count |
link-name |
Active link fetchers |
link-target-fetcher-count |
link-name , pool |
Target fetchers per pool |
link-fetcher-throttled-partition-count |
link-name |
Throttled partitions |
link-fetcher-unassigned-partition-count |
link-name |
Unassigned partitions |
link-fetcher-degraded-partition-count |
link-name , reason |
Degraded partitions with reason |
fetch-throttle-time-avg|max |
link-name |
Fetch throttle time statistics |
Task Status Metrics¶
Metric Name | Tags | Description |
---|---|---|
link-task-count |
link-name , task-name , state , reason , mode , connection-mode |
Task status monitoring |
mirror-transition-in-error |
link-name , state , reason , mode , connection-mode |
Mirror transition errors |
Administrative and Queue Metrics¶
Metric Name | Tags | Description |
---|---|---|
remote-admin-queue-size |
mode , connection-mode |
Remote admin queue backlog |
remote-admin-queue-time-ms-avg|max |
link-name |
Admin queue wait time |
remote-admin-request-time-ms-avg|max |
link-name |
Admin request processing time |
local-admin-queue-size |
thread |
Local admin queue per thread |
local-admin-queue-time-ms-avg|max |
link-name |
Local admin queue wait time |
local-admin-request-time-ms-avg|max |
link-name |
Local admin request processing time |
Data Sync and Topic Management Metrics¶
Metric Name | Tags | Description |
---|---|---|
linked-topic-partition-addition-total|rate |
link-name |
Topic partition additions |
linked-leader-epoch-change-total|rate |
link-name |
Leader epoch changes |
acls-added-total|rate |
link-name |
ACLs successfully added |
acls-add-failed-total|rate |
link-name |
Failed ACL additions |
acls-deleted-total|rate |
link-name |
ACLs successfully deleted |
acls-delete-failed-total|rate |
link-name |
Failed ACL deletions |
describe-acls-from-source-failed-total|rate |
link-name |
Failed ACL descriptions |
consumer-offset-committed-total|rate |
link-name |
Consumer offsets synced |
consumer-offset-commit-failed-total|rate |
link-name |
Failed offset commits |
topic-config-update-total|rate |
link-name |
Topic config updates |
topic-config-update-failed-total|rate |
link-name |
Failed config updates |
Auto-Mirror and Automation Metrics¶
Metric Name | Tags | Description |
---|---|---|
auto-mirror-created-total|rate |
link-name |
Auto-created mirror topics |
auto-mirror-create-failed-total|rate |
link-name |
Failed auto-mirror creation |
auto-mirror-list-topics-from-source-failed-total|rate |
link-name |
Failed source topic listing |
auto-mirror-list-topics-from-destination-failed-total|rate |
link-name |
Failed destination topic listing |
auto-mirror-list-mirrors-from-source-failed-total|rate |
link-name |
Failed mirror listing from source |
list-consumer-group-offsets-from-source-failed-total|rate |
link-name |
Failed offset listing from source |
list-consumer-group-offsets-from-destination-failed-total|rate |
link-name |
Failed offset listing from destination |
list-consumer-groups-from-source-failed-total|rate |
link-name |
Failed consumer group listing |
prefixed-auto-mirror-created-total|rate |
link-name |
Prefixed auto-created mirrors |
prefixed-auto-mirror-create-failed-total|rate |
link-name |
Failed prefixed auto-creation |
prefixed-auto-mirror-topic-filtered-count |
link-name |
Topics filtered out from mirroring |
Throttle and Performance Timing Metrics¶
Metric Name | Tags | Description |
---|---|---|
destination-lag-link-fetcher-throttle-total|rate |
link-name |
Throttle due to destination lag |
link-fetcher-produce-throttle-total|rate |
link-name |
Produce quota throttling |
link-fetcher-request-throttle-total|rate |
link-name |
Request quota throttling |
link-fetcher-fetch-time-avg|max |
link-name , state |
Fetch timing (state=fetch or wait) |
time-to-stop-mirror-topic-failover-ms-avg|max |
link-name |
Time to failover mirror topic |
time-to-stop-mirror-topic-promote-ms-avg|max |
link-name |
Time to promote mirror topic |
link-unavailable-total|rate |
link-name , mode , connection-mode |
Link unavailability incidents |
Background Thread and Resource Metrics¶
Metric Name | Tags | Description |
---|---|---|
background-thread-usage |
thread |
Background thread active usage |
background-thread-tenants |
thread |
Tenants per background thread |
background-thread-link-coordinators |
thread |
Link coordinators per thread |
background-thread-max-task-wait-time-ms |
thread |
Max task wait time per thread |
fetch-size-allocated-total |
pool |
Total fetch size allocated |
tenant-fetch-size-allocated-min |
pool |
Min tenant fetch allocation |
tenant-fetch-size-allocated-max |
pool |
Max tenant fetch allocation |
Multi-Tenant Specific Metrics¶
Metric Name | Tags | Description |
---|---|---|
metadata-topic-owned-partitions-count |
None | Metadata topic partitions owned |
link-count-with-owned-link-coordinator |
None | Links with owned coordinator |
max-link-count-per-metadata-topic-partition |
None | Max links per metadata partition |
Task Names for task-name Tag¶
consumer-offset-sync
: Consumer offset synchronizationacl-sync
: ACL synchronizationauto-create-mirror
: Automatic mirror creationtopic-configs-sync
: Topic configuration syncclear-mirror-start-offsets
: Clear mirror start offsetspause-mirror-topics
: Pause mirror topicscheck-availability
: Availability checksstate-aggregator
: State aggregationretry-task
: Retry failed operationsperiodic-partition-scheduler
: Partition schedulerdegraded-partition-monitor
: Degraded partition monitor
Common Error Codes for reason Tag¶
Authentication/Authorization:
authentication
,broker-authentication
authorization
,broker-authorization
security-disabled
Configuration Issues:
misconfiguration
policy-violation
invalid-request
Resource Issues:
topic-exists
invalid-topic
unknown-topic-or-partition
acl-limit-exceeded
consumer-group-in-use
Operational Issues:
internal
suppressed-errors
remote-link-not-found
remote-mirror-not-found
link-not-found
State Transition Issues:
unexpected-mirror-state
failed-to-update-metadata
failover-timeout