Monitor Intelligent Replication in Confluent Private Cloud

Confluent Private Cloud provides comprehensive JMX metrics to help you measure cluster and application performance. In addition to standard Kafka metrics, several new Intelligent Replication-specific metrics are available in the kafka.server:type=IntelligentReplication MBean.

Leader side metrics

The following metrics are available on broker leaders for monitoring push replication behavior:

Metric Name	Type	Description
`PushManagerMemoryBytesUsed`	Gauge	Measures the MemoryRecords enqueued or in-flight in the push replication mechanism
`PushSessionEndCount`	Counter	Counter incremented when a push replication session ends for any reason, with a reason tag indicating the cause. During such a failure, the broker automatically transitions to use pull replication.
`StoppingPushSessionsCount`	Gauge	Gauge tracking the number of push sessions that are currently in the process of being stopped, i.e. sessions where the leader has initiated the stop process but has not yet received acknowledgment from the follower.
`PushEventQueueProcessingTimeMs`	Histogram	Histogram tracking the processing time of enqueued events in the push replication mechanism
`PushEventProcessingFailure`	Counter	Counter incremented when an error occurs while processing events in the push replication event queue.
`FollowersAwaitingPushTransition`	Counter	Counter tracking the cumulative number of times follower replicas have satisfied all conditions for transitioning to intelligent replication mode except being fully caught up to the leader’s log end offset (LEO).

Follower side metrics

The following metrics are available on broker followers for monitoring push replication behavior:

Metric Name	Type	Description
`PushPartitionsCount`	Gauge	Gauge tracking the number of follower partitions transitioned to push replication mode, and hence have paused fetching from the leader.
`PullTransitionsCount`	Counter	Counter incremented when a partition is successfully transitioned from PUSH to PULL. Incremented on the follower

Monitor key metrics

Performance Indicators

PushPartitionsCount: Higher values indicate more partitions using optimized push replication. In a healthy cluster, you should see this number increase as partitions transition from pull to push mode.
PushEventQueueProcessingTimeMs: Lower values indicate efficient event processing. High values may indicate resource constraints or bottlenecks in the push replication mechanism.

Health Indicators

PushSessionEndCount: Monitor reason tags to identify issues. Normal operational reasons include leadership changes, but frequent non-retriable errors indicate problems.
PushEventProcessingFailure: Should remain low or zero. Increasing values indicate system health issues that may require investigation.
StoppingPushSessionsCount: Should typically be low. Persistently high values suggest followers are not properly acknowledging session end notifications.

Capacity Indicators

PushManagerMemoryBytesUsed: Monitor to prevent memory pressure. Consistently high values indicate that produce traffic may be overwhelming replication capacity.
FollowersAwaitingPushTransition: High values may indicate followers struggling to catch up due to continuous incoming records preventing them from reaching the leader’s log end offset.

Follow monitoring best practices

Set Up Alerts

Configure alerts for metrics that indicate system health issues:

PushEventProcessingFailure increasing rapidly
PushManagerMemoryBytesUsed consistently high
StoppingPushSessionsCount persistently high

Track Performance

Monitor performance improvements:

Compare end-to-end latency before and after enabling Intelligent Replication
Track PushPartitionsCount to see adoption across partitions
Monitor CPU usage reduction on brokers

Capacity Planning

Use metrics for capacity planning:

PushManagerMemoryBytesUsed for memory allocation
PushEventQueueProcessingTimeMs for processing capacity