Troubleshoot Intelligent Replication in Confluent Private Cloud¶
This topic describes common issues you might encounter with Intelligent Replication and how to resolve them.
Resolve high memory buffer usage in push replication¶
- Symptoms
- The
PushManagerMemoryBytesUsed
metric is consistently high. - Cause
- This indicates that produce traffic is overwhelming replication capacity or followers are slow. This can lead to memory pressure and degraded performance.
- Solution
- Investigate high ingress rates, slow followers, or network/storage issues.
- Consider temporarily reducing produce rates or investigating follower performance.
- Monitor the metric over time to ensure it returns to normal levels.
Resolve push replication sessions ending frequently¶
- Symptoms
- The
PushSessionEndCount
metric shows frequent session terminations. - Cause
- This indicates push replication instability. The broker automatically transitions to use pull replication during such failures.
- Solution
- Monitor reason tags - frequent
REQUEST_NON_RETRIABLE_ERROR
orLEADER_REPLICATION_ERROR
may indicate issues requiring push replication disabling. - Normal operational reasons like
NEW_LEADER_EPOCH
are expected during rolls. - If error rates are high, consider disabling intelligent replication until underlying issues are resolved.
- Monitor reason tags - frequent
Resolve push sessions stuck in stopping state¶
- Symptoms
- The
StoppingPushSessionsCount
metric is persistently high. - Cause
- Followers may not be receiving session end notifications properly, potentially leading to Under-Replicated Partitions (URPs).
- Solution
- If persistently high, followers may not be receiving session end notifications properly, potentially leading to URPs.
- Consider manual leadership changes or disabling intelligent replication if the issue persists.
- Check network connectivity between leaders and followers.
Resolve high push replication event processing latency¶
- Symptoms
- The
PushEventQueueProcessingTimeMs
metric shows consistently high processing times. - Cause
- This indicates performance bottlenecks in the push replication mechanism.
- Solution
- If processing times are consistently high, investigate broker resource constraints (CPU, memory).
- Check for network issues.
- Consider tuning push replication configurations:
confluent.intelligent.replication.push.max.threads
confluent.intelligent.replication.push.threads.per.remote.broker
Resolve push replication event processing failures¶
- Symptoms
- The
PushEventProcessingFailure
counter is frequently increasing. - Cause
- This indicates system health issues in the push replication event processing.
- Solution
- If this counter is frequently increasing, investigate broker logs for specific error details.
- Common causes include:
- Resource exhaustion
- Configuration issues
- Internal state inconsistencies
- Consider temporarily disabling intelligent replication if failure rates are high and impacting system stability.
Resolve followers stuck in pull mode¶
- Symptoms
- The
FollowersAwaitingPushTransition
metric is persistently high. - Cause
- This indicates that continuous incoming records may be preventing followers from ever fully catching up to transition to push mode.
- Solution
- If this metric is persistently high, investigate if high ingress rates are constantly moving the LEO target.
- You may need to temporarily reduce produce rates to allow followers to catch up.
Resolve low number of partitions using push replication¶
- Symptoms
- The
PushPartitionsCount
metric is unexpectedly low after enabling push replication. - Cause
- This indicates that partitions aren’t transitioning to push mode as expected.
- Solution
- Use this metric to confirm push replication is enabled and functioning.
- If unexpectedly low after enabling push replication, investigate why partitions aren’t transitioning to push mode (may indicate ISR issues or configuration problems).
- Check that Intelligent Replication is properly enabled and that cluster conditions allow for push replication transitions.
Resolve frequent transitions from push to pull replication¶
- Symptoms
- The
PullTransitionsCount
metric is frequently increasing. - Cause
- This indicates push replication instability and followers are falling back to pull replication.
- Solution
- If frequently increasing, investigate underlying causes:
- Network instability
- Follower performance issues
- ISR membership problems
- High transition rates may indicate push replication should be disabled until issues are resolved.
- Monitor network connectivity and follower broker health.
- If frequently increasing, investigate underlying causes:
Understanding replication sessions¶
Intelligent Replication uses replication session IDs to coordinate transitions between push and pull modes. Understanding these sessions can help with troubleshooting:
- Session Lifecycle
- Each partition replica maintains a replication session ID.
- Sessions start in pull mode and can transition to push mode.
- When issues occur, sessions transition back to pull mode with a new session ID.
- Session State Indicators
- Check the
kafka-replica-status
CLI tool to see current replication modes. - Monitor
PushSessionEndCount
for session transition frequency. - Look for session-related messages in broker logs.
- Check the
- Common Session Issues
- Frequent session transitions: May indicate network instability or follower performance issues.
- Stuck sessions: Check
StoppingPushSessionsCount
for sessions that aren’t properly ending. - Session ID mismatches: Usually resolved automatically but may indicate timing issues.
Follow general troubleshooting steps¶
- Check Broker Logs
- Examine broker logs for specific error messages related to Intelligent Replication, particularly messages about session transitions and replication mode changes.
- Verify Configuration
- Ensure that
confluent.intelligent.replication.enable=true
is properly set and that brokers have been restarted. - Monitor System Resources
- Check CPU, memory, and network utilization on broker nodes. Push replication should reduce CPU usage compared to pull replication.
- Test Network Connectivity
- Verify network connectivity between leader and follower brokers. Push replication requires reliable network communication from leaders to followers.
- Check ISR Membership
- Ensure replicas are properly joining the In-Sync Replica (ISR) set, as this is required for transition to push mode.
- Gradual Rollback
- If issues persist, consider temporarily disabling Intelligent Replication by
setting
confluent.intelligent.replication.enable=false
and restarting brokers.