Monitor Intelligent Replication in Confluent Private Cloud¶
Confluent Private Cloud provides comprehensive JMX metrics to help you measure cluster and
application performance. In addition to standard Kafka metrics, several new
Intelligent Replication-specific metrics are available in the
kafka.server:type=IntelligentReplication
MBean.
Leader side metrics¶
The following metrics are available on broker leaders for monitoring push replication behavior:
Metric Name | Type | Description |
---|---|---|
PushManagerMemoryBytesUsed |
Gauge | Measures the MemoryRecords enqueued or in-flight in the push replication mechanism |
PushSessionEndCount |
Counter | Counter incremented when a push replication session ends for any reason, with a reason tag indicating the cause. During such a failure, the broker automatically transitions to use pull replication. |
StoppingPushSessionsCount |
Gauge | Gauge tracking the number of push sessions that are currently in the process of being stopped, i.e. sessions where the leader has initiated the stop process but has not yet received acknowledgment from the follower. |
PushEventQueueProcessingTimeMs |
Histogram | Histogram tracking the processing time of enqueued events in the push replication mechanism |
PushEventProcessingFailure |
Counter | Counter incremented when an error occurs while processing events in the push replication event queue. |
FollowersAwaitingPushTransition |
Counter | Counter tracking the cumulative number of times follower replicas have satisfied all conditions for transitioning to intelligent replication mode except being fully caught up to the leader’s log end offset (LEO). |
Follower side metrics¶
The following metrics are available on broker followers for monitoring push replication behavior:
Metric Name | Type | Description |
---|---|---|
PushPartitionsCount |
Gauge | Gauge tracking the number of follower partitions transitioned to push replication mode, and hence have paused fetching from the leader. |
PullTransitionsCount |
Counter | Counter incremented when a partition is successfully transitioned from PUSH to PULL. Incremented on the follower |
Monitor key metrics¶
- Performance Indicators
PushPartitionsCount
: Higher values indicate more partitions using optimized push replication. In a healthy cluster, you should see this number increase as partitions transition from pull to push mode.PushEventQueueProcessingTimeMs
: Lower values indicate efficient event processing. High values may indicate resource constraints or bottlenecks in the push replication mechanism.
- Health Indicators
PushSessionEndCount
: Monitor reason tags to identify issues. Normal operational reasons include leadership changes, but frequent non-retriable errors indicate problems.PushEventProcessingFailure
: Should remain low or zero. Increasing values indicate system health issues that may require investigation.StoppingPushSessionsCount
: Should typically be low. Persistently high values suggest followers are not properly acknowledging session end notifications.
- Capacity Indicators
PushManagerMemoryBytesUsed
: Monitor to prevent memory pressure. Consistently high values indicate that produce traffic may be overwhelming replication capacity.FollowersAwaitingPushTransition
: High values may indicate followers struggling to catch up due to continuous incoming records preventing them from reaching the leader’s log end offset.
Follow monitoring best practices¶
- Set Up Alerts
Configure alerts for metrics that indicate system health issues:
PushEventProcessingFailure
increasing rapidlyPushManagerMemoryBytesUsed
consistently highStoppingPushSessionsCount
persistently high
- Track Performance
Monitor performance improvements:
- Compare end-to-end latency before and after enabling Intelligent Replication
- Track
PushPartitionsCount
to see adoption across partitions - Monitor CPU usage reduction on brokers
- Capacity Planning
Use metrics for capacity planning:
PushManagerMemoryBytesUsed
for memory allocationPushEventQueueProcessingTimeMs
for processing capacity