Monitor Intelligent Replication in Confluent Private Cloud

Confluent Private Cloud provides comprehensive JMX metrics to help you measure cluster and application performance. In addition to standard Kafka metrics, several new Intelligent Replication-specific metrics are available in the kafka.server:type=IntelligentReplication MBean.

Leader side metrics

The following metrics are available on broker leaders for monitoring push replication behavior:

Metric Name Type Description
PushManagerMemoryBytesUsed Gauge Measures the MemoryRecords enqueued or in-flight in the push replication mechanism
PushSessionEndCount Counter Counter incremented when a push replication session ends for any reason, with a reason tag indicating the cause. During such a failure, the broker automatically transitions to use pull replication.
StoppingPushSessionsCount Gauge Gauge tracking the number of push sessions that are currently in the process of being stopped, i.e. sessions where the leader has initiated the stop process but has not yet received acknowledgment from the follower.
PushEventQueueProcessingTimeMs Histogram Histogram tracking the processing time of enqueued events in the push replication mechanism
PushEventProcessingFailure Counter Counter incremented when an error occurs while processing events in the push replication event queue.
FollowersAwaitingPushTransition Counter Counter tracking the cumulative number of times follower replicas have satisfied all conditions for transitioning to intelligent replication mode except being fully caught up to the leader’s log end offset (LEO).

Follower side metrics

The following metrics are available on broker followers for monitoring push replication behavior:

Metric Name Type Description
PushPartitionsCount Gauge Gauge tracking the number of follower partitions transitioned to push replication mode, and hence have paused fetching from the leader.
PullTransitionsCount Counter Counter incremented when a partition is successfully transitioned from PUSH to PULL. Incremented on the follower

Monitor key metrics

Performance Indicators
  • PushPartitionsCount: Higher values indicate more partitions using optimized push replication. In a healthy cluster, you should see this number increase as partitions transition from pull to push mode.
  • PushEventQueueProcessingTimeMs: Lower values indicate efficient event processing. High values may indicate resource constraints or bottlenecks in the push replication mechanism.
Health Indicators
  • PushSessionEndCount: Monitor reason tags to identify issues. Normal operational reasons include leadership changes, but frequent non-retriable errors indicate problems.
  • PushEventProcessingFailure: Should remain low or zero. Increasing values indicate system health issues that may require investigation.
  • StoppingPushSessionsCount: Should typically be low. Persistently high values suggest followers are not properly acknowledging session end notifications.
Capacity Indicators
  • PushManagerMemoryBytesUsed: Monitor to prevent memory pressure. Consistently high values indicate that produce traffic may be overwhelming replication capacity.
  • FollowersAwaitingPushTransition: High values may indicate followers struggling to catch up due to continuous incoming records preventing them from reaching the leader’s log end offset.

Follow monitoring best practices

Set Up Alerts

Configure alerts for metrics that indicate system health issues:

  • PushEventProcessingFailure increasing rapidly
  • PushManagerMemoryBytesUsed consistently high
  • StoppingPushSessionsCount persistently high
Track Performance

Monitor performance improvements:

  • Compare end-to-end latency before and after enabling Intelligent Replication
  • Track PushPartitionsCount to see adoption across partitions
  • Monitor CPU usage reduction on brokers
Capacity Planning

Use metrics for capacity planning:

  • PushManagerMemoryBytesUsed for memory allocation
  • PushEventQueueProcessingTimeMs for processing capacity