Broker and Controller Metrics

This topic describes JMX metrics for Kafka brokers, KRaft mode, and controllers. These metrics are useful for monitoring the health and performance of your Kafka cluster.

For information about how to configure JMX, see Configure JMX for Monitoring.

Search for a metric

Broker metrics

There are many metrics reported at the broker and controller level that can be monitored and used to troubleshoot issues with your cluster. At minimum, you should monitor and set alerts on ActiveControllerCount, OfflinePartitionsCount, and UncleanLeaderElectionsPerSec.

AtMinIsr

MBean: kafka.cluster:type=Partition,topic={topic},name=AtMinIsr,partition={partition}: The number of partitions whose in-sync replicas count is equal to the minIsr value.

Bandwidth quota

MBean: kafka.server:type={Produce|Fetch},user={userName},client-id={clientId}

Use the attributes of this metric to measure the bandwidth quota. This metric has the following attributes:

throttle-time: the amount of time in milliseconds the client was throttled. Ideally = 0.
byte-rate: the data produce/consume rate of the client in bytes/sec.
- For (user, client-id) quotas, specify both user and client-id.
- If a per-client-id quota is applied to the client, do not specify user.
- If a per-user quota is applied, do not specify client-id.

BrokerRegistrationState

MBean: kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X

A per-broker metric that displays the following state values:

10 indicates that the broker is fenced
20 indicates that the broker is in controlled shutdown
30 indicates that the broker is active

This metric is added when a broker registers and is removed when it unregisters. This state is derived from the registration records contained in the metadata log.

BytesInPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic={topicName}: The incoming byte rate from clients, per topic. Omitting ‘topic={…}’ will yield the all-topic rate.

BytesOutPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic={topicName}: The outgoing byte rate to clients per topic. Omitting ‘topic={…}’ will yield the all-topic rate.

BytesRejectedPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic={topicName}: The rejected byte rate per topic, due to the record batch size being greater than max.message.bytes configuration. Omitting ‘topic={…}’ will yield the all-topic rate.

clientSoftwareName/clientSoftwareVersion

MBean:

kafka.server:clientSoftwareName=(name),clientSoftwareVersion=(version),listener=(listener),networkProcessor=(processor-index),type=(type)

The name and version of client software in the brokers. For example, the Kafka 2.4 Java client produces the following MBean on the broker:

kafka.server:clientSoftwareName=apache-kafka-java,clientSoftwareVersion=2.4.0,listener=PLAINTEXT,networkProcessor=1,type=socket-server-metrics

connection-count

MBean: kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-count

The number of currently open connections to the broker.

connection-creation-rate

MBean: kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate

The number of new connections established per second.

consumer-lag-offsets

MBean:

kafka.server:type=tenant-metrics,member={mbrId},topic={tpcName},consumer-group={gpName},partition={Id},client-id={cliId},group-protocol={grpProtocol}

Attribute: consumer-lag-offsets

This metric is the difference between the last offset stored by the broker and the last committed offset for a specific consumer group name, client ID, member ID, partition ID, topic name, and group protocol. The group protocol specifies the rebalance protocol used by the consumer group, currently either classic or consumer. For more information about the rebalance protocols, see Consumer Rebalance Protocols. This metric provides the consumer lag in offsets only and does not report latency. In addition, it is not reported for any groups that are not alive or are empty.

To enable this metric, you must set the following server properties.

confluent.consumer.lag.emitter.enabled=true # default is false
confluent.consumer.lag.emitter.interval.ms=60000 # default is 60000

For more information about this metric, see Monitor Consumer Lag in Confluent Platform.

ConsumerLag

MBean: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic={topicName},partition=([0-9]+)

The lag in number of messages per follower replica. This is useful to know if the replica is slow or has stopped replicating from the leader and if the associated brokers need to be removed from the In-Sync Replicas list.

ControlledShutdownBrokerCount

MBean: kafka.controller:type=KafkaController,name=ControlledShutdownBrokerCount: The number of brokers currently in controlled shutdown.

CurrentControllerId

MBean: kafka.server:type=MetadataLoader,name=CurrentControllerId: Outputs the ID of the current controller, or -1 if none is known. Reports the current controller ID on broker and controller nodes.

DelayQueueSize

MBean: kafka.server:type=Produce,name=DelayQueueSize: The number of producer clients currently being throttled. The value can be any number greater than or equal to 0.
Important
For monitoring quota applications and throttled clients, use the Bandwidth quota, and Request quota metrics.

ElectionFromEligibleLeaderReplicasPerSec

MBean: kafka.controller:type=ControllerStats,name=ElectionFromEligibleLeaderReplicasPerSec: The Eligible Leader Replicas (ELR) election rate. It increments by one when a leader is elected.

ElectionRateAndTimeMs

MBean: kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs: The broker leader election rate and latency in milliseconds. This is non-zero when there are broker failures.

FailedFetchRequestsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec: The fetch request rate for requests that failed.

FailedProduceRequestsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec: The produce request rate for requests that failed.

InSyncReplicasCount

MBean: kafka.cluster:type=Partition,topic={topic},name=InSyncReplicasCount,partition={partition}: A gauge metric that indicates the in-sync replica count per topic partition leader.

InvalidMagicNumberRecordsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=InvalidMagicNumberRecordsPerSec: The message validation failure rate due to an invalid magic number. This should be 0.

InvalidMessageCrcRecordsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=InvalidMessageCrcRecordsPerSec: The message validation failure rate due to incorrect Crc checksum

InvalidOffsetOrSequenceRecordsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=InvalidOffsetOrSequenceRecordsPerSec: The message validation failure rate due to non-continuous offset or sequence number in batch. Normally this should be 0.

IsrExpandsPerSec

MBean: kafka.server:type=ReplicaManager,name=IsrExpandsPerSec: Measures the expansion of in-sync replicas per second. When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR.

IsrShrinksPerSec

MBean: kafka.server:type=ReplicaManager,name=IsrShrinksPerSec: Measures the reduction of in-sync replicas per second. If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.

LeaderCount

MBean: kafka.server:type=ReplicaManager,name=LeaderCount: The number of leaders on this broker. This should be mostly even across all brokers. If not, set auto.leader.rebalance.enable to true on all brokers in the cluster.

linux-disk-read-bytes

MBean: kafka.server:type=KafkaServer,name=linux-disk-read-bytes: The total number of bytes read by the broker process, including reads from all disks. The total doesn’t include reads from page cache. Available only on Linux-based systems.

linux-disk-write-bytes

MBean: kafka.server:type=KafkaServer,name=linux-disk-write-bytes: The total number of bytes written by the broker process, including writes from all disks. Available only on Linux-based systems.

MessageConversionsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name={Produce|Fetch}MessageConversionsPerSec,topic=([-.\w]+)

The message format conversion rate, for Produce or Fetch requests, per topic. Omitting ‘topic={…}’ will yield the all-topic rate.

MessagesInPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic={topicName}: The incoming message rate per topic. Omitting ‘topic={…}’ will yield the all-topic rate.

NoKeyCompactedTopicRecordsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=NoKeyCompactedTopicRecordsPerSec: The message validation failure rate due to no key specified for compacted topic. This should be 0.

PartitionCount

MBean: kafka.server:type=ReplicaManager,name=PartitionCount: The number of partitions on this broker. This should be mostly even across all brokers.

PartitionsWithLateTransactionCount

MBean: kafka.server:type=ReplicaManager,name=PartitionsWithLateTransactionsCount: The number of partitions that have open transactions with durations exceeding the transaction.max.timeout.ms property value set on the broker.

PurgatorySize (fetch)

MBean: kafka.server:type=DelayedOperationPurgatory,delayedOperation=Fetch,name=PurgatorySize: The number of requests waiting in the fetch purgatory. This is high if consumers use a large value for fetch.wait.max.ms

PurgatorySize (produce)

MBean: kafka.server:type=DelayedOperationPurgatory,delayedOperation=Produce,name=PurgatorySize: The number of requests waiting in the producer purgatory. This should be non-zero when acks=all is used on the producer.

ReassigningPartitions

MBean: kafka.server:type=ReplicaManager,name=ReassigningPartitions: The number of reassigning partitions.

ReassignmentBytesInPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSec: The incoming byte rate of reassignment traffic.

ReplicasCount

MBean: kafka.cluster:type=Partition,topic={topic},name=ReplicasCount,partition={partition}: A gauge metric that indicates the replica count per topic partition leader.

ReplicationBytesInPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec,topic={topicName}: The incoming byte rate from other brokers per topic. Omitting ‘topic={…}’ will yield the all-topic rate.

ReplicationBytesOutPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec: The byte-out rate to other brokers.

Request quota

MBean: kafka.server:type=Request,user={userName},client-id={clientId}

Use the attributes of this metric to measure request quota. This metric has the following attributes:

throttle-time: the amount of time in milliseconds the client was throttled. Ideally = 0.
request-time: the percentage of time spent in broker network and I/O threads to process requests from client group.
- For (user, client-id) quotas, specify both user and client-id.
- If a per-client-id quota is applied to the client, do not specify user.
- If a per-user quota is applied, do not specify client-id.

RequestHandlerAvgIdlePercent

MBean: kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent: The average fraction of time the request handler threads are idle. Values are between 0 meaning all resources are used and 1 meaning all resources are available.

TimeSinceLastHeartbeatReceivedMs

MBean: kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs,broker=X

A per-broker metric that reports the time in milliseconds since the last heartbeat received by the controller.

The maximum value is the heartbeat session timeout limit. This is because the map whose values are being exposed removes the broker’s contact time when the broker gets fenced.
The default session timeout is nine seconds.
This metric is only reported in the active controller because it’s soft state is contained in the BrokerHeartbeatTracker.

This metric is added when a broker registers and removed when it unregisters.

TotalFetchRequestsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec: The fetch request rate per second.

TotalProduceRequestsPerSec

MBean: kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec: The produce request rate per second.

UncleanLeaderElectionsPerSec

MBean: kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec: The unclean broker leader election rate. Should be 0.

UnderMinIsr

MBean: kafka.cluster:type=Partition,topic={topic},name=UnderMinIsr,partition={partition}: The number of partitions whose in-sync replicas count is less than minIsr. These partitions will be unavailable to producers who use acks=all.

UnderMinIsrPartitionCount

MBean: kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount: The number of partitions whose in-sync replicas count is less than minIsr.

UnderReplicated

MBean: kafka.cluster:type=Partition,topic={topic},name=UnderReplicated,partition={partition}: The number of partitions that are under replicated meaning the number of in-sync replicas is less than the replica count.

UnderReplicatedPartitions

MBean: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions: The number of under-replicated partitions (| ISR | < | current replicas |). Replicas that are added as part of a reassignment will not count toward this value. Alert if the value is greater than 0.

KRaft broker metrics

These metrics are only produced for a broker when running in KRaft mode.

last-applied-record-lag-ms

MBean: kafka.server:type=broker-metadata-metrics,name=last-applied-record-lag-ms: The difference between now and the timestamp of the last record from the cluster metadata partition that was applied by the broker.

last-applied-record-offset

MBean: kafka.server:type=broker-metadata-metrics,name=last-applied-record-offset: The offset of the last record from the cluster metadata partition that was applied by the broker.

last-applied-record-timestamp

MBean: kafka.server:type=broker-metadata-metrics,name=last-applied-record-timestamp: The timestamp of the last record from the cluster metadata partition that was applied by the broker.

metadata-apply-error-count

MBean: kafka.server:type=broker-metadata-metrics,name=metadata-apply-error-count: The number of errors encountered by the BrokerMetadataPublisher while applying a new MetadataImage based on the latest MetadataDelta.

metadata-load-error-count

MBean: kafka.server:type=broker-metadata-metrics,name=metadata-load-error-count: The number of errors encountered by the BrokerMetadataListener while loading the metadata log and generating a new MetadataDelta based on it.

KRaft Quorum metrics

The set of metrics that allow monitoring of the KRaft quorum and metadata log. These metrics are reported on both controllers and brokers in a KRaft cluster.

CurrentMetadataVersion

MBean: kafka.server:type=MetadataLoader,name=CurrentMetadataVersion: Outputs the feature level of the current effective metadata version.

HandleLoadSnapshotCount

MBean: kafka.server:type=MetadataLoader,name=HandleLoadSnapshotCount: The total number of times we have loaded a KRaft snapshot since the process was started.

Following are attributes of the kafka.server:type=raft-metrics MBean:

append-records-rate

MBean: kafka.server:type=raft-metrics Attribute: append-records-rate: The average number of records appended per sec by the leader of the raft quorum.

commit-latency-avg

MBean: kafka.server:type=raft-metrics Attribute: commit-latency-avg: The average time in milliseconds to commit an entry in the raft log.

commit-latency-max

MBean: kafka.server:type=raft-metrics Attribute: commit-latency-max: The maximum time in milliseconds to commit an entry in the raft log.

current-epoch

MBean: kafka.server:type=raft-metrics Attribute: current-epoch: The current quorum epoch.

current-leader

MBean: kafka.server:type=raft-metrics Attribute: current-leader: The current quorum leader’s id; -1 indicates unknown.

current-state

MBean: kafka.server:type=raft-metrics Attribute: current-state: The current state of this member; possible values are leader, candidate, voted, follower, unattached, observer.

current-vote

MBean: kafka.server:type=raft-metrics Attribute: current-vote: The current voted leader’s id; -1 indicates not voted for anyone.

election-latency-avg

MBean: kafka.server:type=raft-metrics Attribute: election-latency-avg: The average time in milliseconds spent on electing a new leader.

election-latency-max

MBean: kafka.server:type=raft-metrics Attribute: election-latency-max: The maximum time in milliseconds spent on electing a new leader.

fetch-records-rate

MBean: kafka.server:type=raft-metrics Attribute: fetch-records-rate: The average number of records fetched from the leader of the raft quorum.

high-watermark

MBean: kafka.server:type=raft-metrics Attribute: high-watermark: The high watermark maintained on this member; -1 if it is unknown.

log-end-epoch

MBean: kafka.server:type=raft-metrics Attribute: log-end-epoch: The current raft log end epoch.

log-end-offset

MBean: kafka.server:type=raft-metrics Attribute: log-end-offset: The current raft log end offset.

number-unknown-voter-connections

MBean: kafka.server:type=raft-metrics Attribute: number-unknown-voter-connections: The number of unknown voters whose connection information is not cached. This value of this metric is always 0.

poll-idle-ratio-avg

MBean: kafka.server:type=raft-metrics Attribute: poll-idle-ratio-avg: The average fraction of time the client’s poll() is idle as opposed to waiting for the user code to process records.

LatestSnapshotGeneratedAgeMs

MBean: kafka.server:type=SnapshotEmitter,name=LatestSnapshotGeneratedAgeMs: The interval in milliseconds since the latest snapshot that the node has generated. If no snapshot has been generated yet, this is the approximate time delta since the process was started.

LatestSnapshotGeneratedBytes

MBean: kafka.server:type=SnapshotEmitter,name=LatestSnapshotGeneratedBytes: The total size in bytes of the latest snapshot that the node has generated. If a snapshot has not been generated yet, this is the size of the latest snapshot that was loaded. If no snapshots have been generated or loaded, this is 0.

Controller metrics

The following metrics are exposed by a controller. For more about monitoring KRaft, see Monitor KRaft.

ActiveBrokerCount

MBean: kafka.controller:type=KafkaController,name=ActiveBrokerCount: The number of active brokers as observed by this controller.

ActiveControllerCount

MBean: kafka.controller:type=KafkaController,name=ActiveControllerCount: The number of active controllers in the cluster. Valid values are ‘0’ or ‘1’. Alert if the aggregated sum across all brokers in the cluster is anything other than 1 because there should be exactly one controller per cluster.

EventQueueOperationsStartedCount

MBean: kafka.controller:type=KafkaController,name=EventQueueOperationsStartedCount: For KRaft mode, the total number of controller event queue operations that were started. This includes deferred operations.

EventQueueOperationsTimedOutCount

MBean: kafka.controller:type=KafkaController,name=EventQueueOperationsTimedOutCount: For KRaft mode, the total number of controller event queue operations that timed out before they could be performed.

EventQueueProcessingTimeMs

MBean: kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs: A Histogram of the time in milliseconds that requests spent being processed in the Controller Event Queue.

EventQueueSize

MBean: kafka.controller:type=ControllerEventManager,name=EventQueueSize: Size of the controller’s event queue.

EventQueueTimeMs

MBean: kafka.controller:type=ControllerEventManager,name=EventQueueTimeMs: Time that an event (except the Idle event) waits, in milliseconds, in the controller event queue before being processed.

FencedBrokerCount

MBean: kafka.controller:type=KafkaController,name=FencedBrokerCount: In KRaft mode, the number of fenced, but registered brokers as observed by this controller.

GlobalPartitionCount

MBean: kafka.controller:type=KafkaController,name=GlobalPartitionCount: The number of partitions across all topics in the cluster.

GlobalTopicCount

MBean: kafka.controller:type=KafkaController,name=GlobalTopicCount: The number of global topics as observed by this Controller.

LastAppliedRecordLagMs

MBean: kafka.controller:type=KafkaController,name=LastAppliedRecordLagMs: The difference, in milliseconds, between now and the timestamp of the last record from the cluster metadata partition that was applied by the controller. For active controllers the value of this lag is always zero.

LastAppliedRecordOffset

MBean: kafka.controller:type=KafkaController,name=LastAppliedRecordOffset: The offset of the last record from the cluster metadata partition that was applied by the Controller.

LastAppliedRecordTimestamp

MBean: kafka.controller:type=KafkaController,name=LastAppliedRecordTimestamp: The timestamp of the last record from the cluster metadata partition that was applied by the controller.

LastCommittedRecordOffset

MBean: kafka.controller:type=KafkaController,name=LastCommittedRecordOffset: The offset of the last record committed to this Controller.

MetadataErrorCount

MBean: kafka.controller:type=KafkaController,name=MetadataErrorCount: The number of times this controller node has encountered an error during metadata log processing.

NewActiveControllersCount

MBean: kafka.controller:type=KafkaController,name=NewActiveControllersCount: For KRaft mode, counts the number of times this node has seen a new controller elected. A transition to the “no leader” state is not counted here. If the same controller as before becomes active, that still counts.

OfflinePartitionsCount

MBean: kafka.controller:type=KafkaController,name=OfflinePartitionsCount,partition={partition}: The number of partitions that don’t have an active leader and are therefore not writable or readable. Alert if value is greater than 0.

PreferredReplicaImbalanceCount

MBean: kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount: The count of topic partitions for which the leader is not the preferred leader.

ReplicasIneligibleToDeleteCount

MBean: kafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount: The number of ineligible pending replica deletes.

ReplicasToDeleteCount

MBean: kafka.controller:type=KafkaController,name=ReplicasToDeleteCount: Pending replica deletes.

TimedOutBrokerHeartbeatCount

MBean: kafka.controller:type=KafkaController,name=TimedOutBrokerHeartbeatCount: For KRaft mode, the number of broker heartbeats that timed out on this controller since the process was started. Note that only active controllers handle heartbeats, so only they will see increases in this metric.

TopicsIneligibleToDeleteCount

MBean: kafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount: Ineligible pending topic deletes.

TopicsToDeleteCount

MBean: kafka.controller:type=KafkaController,name=TopicsToDeleteCount: Pending topic deletes.