Troubleshoot Intelligent Replication in Confluent Private Cloud

This topic describes common issues you might encounter with Intelligent Replication and how to resolve them.

Resolve high memory buffer usage in push replication

Symptoms
The PushManagerMemoryBytesUsed metric is consistently high.
Cause
This indicates that produce traffic is overwhelming replication capacity or followers are slow. This can lead to memory pressure and degraded performance.
Solution
  1. Investigate high ingress rates, slow followers, or network/storage issues.
  2. Consider temporarily reducing produce rates or investigating follower performance.
  3. Monitor the metric over time to ensure it returns to normal levels.

Resolve push replication sessions ending frequently

Symptoms
The PushSessionEndCount metric shows frequent session terminations.
Cause
This indicates push replication instability. The broker automatically transitions to use pull replication during such failures.
Solution
  1. Monitor reason tags - frequent REQUEST_NON_RETRIABLE_ERROR or LEADER_REPLICATION_ERROR may indicate issues requiring push replication disabling.
  2. Normal operational reasons like NEW_LEADER_EPOCH are expected during rolls.
  3. If error rates are high, consider disabling intelligent replication until underlying issues are resolved.

Resolve push sessions stuck in stopping state

Symptoms
The StoppingPushSessionsCount metric is persistently high.
Cause
Followers may not be receiving session end notifications properly, potentially leading to Under-Replicated Partitions (URPs).
Solution
  1. If persistently high, followers may not be receiving session end notifications properly, potentially leading to URPs.
  2. Consider manual leadership changes or disabling intelligent replication if the issue persists.
  3. Check network connectivity between leaders and followers.

Resolve high push replication event processing latency

Symptoms
The PushEventQueueProcessingTimeMs metric shows consistently high processing times.
Cause
This indicates performance bottlenecks in the push replication mechanism.
Solution
  1. If processing times are consistently high, investigate broker resource constraints (CPU, memory).
  2. Check for network issues.
  3. Consider tuning push replication configurations:
    • confluent.intelligent.replication.push.max.threads
    • confluent.intelligent.replication.push.threads.per.remote.broker

Resolve push replication event processing failures

Symptoms
The PushEventProcessingFailure counter is frequently increasing.
Cause
This indicates system health issues in the push replication event processing.
Solution
  1. If this counter is frequently increasing, investigate broker logs for specific error details.
  2. Common causes include:
    • Resource exhaustion
    • Configuration issues
    • Internal state inconsistencies
  3. Consider temporarily disabling intelligent replication if failure rates are high and impacting system stability.

Resolve followers stuck in pull mode

Symptoms
The FollowersAwaitingPushTransition metric is persistently high.
Cause
This indicates that continuous incoming records may be preventing followers from ever fully catching up to transition to push mode.
Solution
  1. If this metric is persistently high, investigate if high ingress rates are constantly moving the LEO target.
  2. You may need to temporarily reduce produce rates to allow followers to catch up.

Resolve low number of partitions using push replication

Symptoms
The PushPartitionsCount metric is unexpectedly low after enabling push replication.
Cause
This indicates that partitions aren’t transitioning to push mode as expected.
Solution
  1. Use this metric to confirm push replication is enabled and functioning.
  2. If unexpectedly low after enabling push replication, investigate why partitions aren’t transitioning to push mode (may indicate ISR issues or configuration problems).
  3. Check that Intelligent Replication is properly enabled and that cluster conditions allow for push replication transitions.

Resolve frequent transitions from push to pull replication

Symptoms
The PullTransitionsCount metric is frequently increasing.
Cause
This indicates push replication instability and followers are falling back to pull replication.
Solution
  1. If frequently increasing, investigate underlying causes:
    • Network instability
    • Follower performance issues
    • ISR membership problems
  2. High transition rates may indicate push replication should be disabled until issues are resolved.
  3. Monitor network connectivity and follower broker health.

Understanding replication sessions

Intelligent Replication uses replication session IDs to coordinate transitions between push and pull modes. Understanding these sessions can help with troubleshooting:

Session Lifecycle
  • Each partition replica maintains a replication session ID.
  • Sessions start in pull mode and can transition to push mode.
  • When issues occur, sessions transition back to pull mode with a new session ID.
Session State Indicators
  • Check the kafka-replica-status CLI tool to see current replication modes.
  • Monitor PushSessionEndCount for session transition frequency.
  • Look for session-related messages in broker logs.
Common Session Issues
  • Frequent session transitions: May indicate network instability or follower performance issues.
  • Stuck sessions: Check StoppingPushSessionsCount for sessions that aren’t properly ending.
  • Session ID mismatches: Usually resolved automatically but may indicate timing issues.

Follow general troubleshooting steps

Check Broker Logs
Examine broker logs for specific error messages related to Intelligent Replication, particularly messages about session transitions and replication mode changes.
Verify Configuration
Ensure that confluent.intelligent.replication.enable=true is properly set and that brokers have been restarted.
Monitor System Resources
Check CPU, memory, and network utilization on broker nodes. Push replication should reduce CPU usage compared to pull replication.
Test Network Connectivity
Verify network connectivity between leader and follower brokers. Push replication requires reliable network communication from leaders to followers.
Check ISR Membership
Ensure replicas are properly joining the In-Sync Replica (ISR) set, as this is required for transition to push mode.
Gradual Rollback
If issues persist, consider temporarily disabling Intelligent Replication by setting confluent.intelligent.replication.enable=false and restarting brokers.