Troubleshoot Intelligent Replication in Confluent Private Cloud
This topic describes common issues you might encounter with Intelligent Replication and how to resolve them.
Resolve high memory buffer usage in push replication
- Symptoms
The
PushManagerMemoryBytesUsedmetric is consistently high.- Cause
This indicates that produce traffic is overwhelming replication capacity or followers are slow. This can lead to memory pressure and degraded performance.
- Solution
Investigate high ingress rates, slow followers, or network/storage issues.
Consider temporarily reducing produce rates or investigating follower performance.
Monitor the metric over time to ensure it returns to normal levels.
Resolve push replication sessions ending frequently
- Symptoms
The
PushSessionEndCountmetric shows frequent session terminations.- Cause
This indicates push replication instability. The broker automatically transitions to use pull replication during such failures.
- Solution
Monitor reason tags - frequent
REQUEST_NON_RETRIABLE_ERRORorLEADER_REPLICATION_ERRORmay indicate issues requiring push replication disabling.Normal operational reasons like
NEW_LEADER_EPOCHare expected during rolls.If error rates are high, consider disabling intelligent replication until underlying issues are resolved.
Resolve push sessions stuck in stopping state
- Symptoms
The
StoppingPushSessionsCountmetric is persistently high.- Cause
Followers may not be receiving session end notifications properly, potentially leading to Under-Replicated Partitions (URPs).
- Solution
If persistently high, followers may not be receiving session end notifications properly, potentially leading to URPs.
Consider manual leadership changes or disabling intelligent replication if the issue persists.
Check network connectivity between leaders and followers.
Resolve high push replication event processing latency
- Symptoms
The
PushEventQueueProcessingTimeMsmetric shows consistently high processing times.- Cause
This indicates performance bottlenecks in the push replication mechanism.
- Solution
If processing times are consistently high, investigate broker resource constraints (CPU, memory).
Check for network issues.
Consider tuning push replication configurations:
confluent.intelligent.replication.push.max.threadsconfluent.intelligent.replication.push.threads.per.remote.broker
Resolve push replication event processing failures
- Symptoms
The
PushEventProcessingFailurecounter is frequently increasing.- Cause
This indicates system health issues in the push replication event processing.
- Solution
If this counter is frequently increasing, investigate broker logs for specific error details.
Common causes include:
Resource exhaustion
Configuration issues
Internal state inconsistencies
Consider temporarily disabling intelligent replication if failure rates are high and impacting system stability.
Resolve followers stuck in pull mode
- Symptoms
The
FollowersAwaitingPushTransitionmetric is persistently high.- Cause
This indicates that continuous incoming records may be preventing followers from ever fully catching up to transition to push mode.
- Solution
If this metric is persistently high, investigate if high ingress rates are constantly moving the LEO target.
You may need to temporarily reduce produce rates to allow followers to catch up.
Resolve low number of partitions using push replication
- Symptoms
The
PushPartitionsCountmetric is unexpectedly low after enabling push replication.- Cause
This indicates that partitions aren’t transitioning to push mode as expected.
- Solution
Use this metric to confirm push replication is enabled and functioning.
If unexpectedly low after enabling push replication, investigate why partitions aren’t transitioning to push mode (may indicate ISR issues or configuration problems).
Check that Intelligent Replication is properly enabled and that cluster conditions allow for push replication transitions.
Resolve frequent transitions from push to pull replication
- Symptoms
The
PullTransitionsCountmetric is frequently increasing.- Cause
This indicates push replication instability and followers are falling back to pull replication.
- Solution
If frequently increasing, investigate underlying causes:
Network instability
Follower performance issues
ISR membership problems
High transition rates may indicate push replication should be disabled until issues are resolved.
Monitor network connectivity and follower broker health.
Understanding replication sessions
Intelligent Replication uses replication session IDs to coordinate transitions between push and pull modes. Understanding these sessions can help with troubleshooting:
- Session Lifecycle
Each partition replica maintains a replication session ID.
Sessions start in pull mode and can transition to push mode.
When issues occur, sessions transition back to pull mode with a new session ID.
- Session State Indicators
Check the
kafka-replica-statusCLI tool to see current replication modes.Monitor
PushSessionEndCountfor session transition frequency.Look for session-related messages in broker logs.
- Common Session Issues
Frequent session transitions: May indicate network instability or follower performance issues.
Stuck sessions: Check
StoppingPushSessionsCountfor sessions that aren’t properly ending.Session ID mismatches: Usually resolved automatically but may indicate timing issues.
Follow general troubleshooting steps
- Check Broker Logs
Examine broker logs for specific error messages related to Intelligent Replication, particularly messages about session transitions and replication mode changes.
- Verify Configuration
Ensure that
confluent.intelligent.replication.enable=trueis properly set and that brokers have been restarted.- Monitor System Resources
Check CPU, memory, and network utilization on broker nodes. Push replication should reduce CPU usage compared to pull replication.
- Test Network Connectivity
Verify network connectivity between leader and follower brokers. Push replication requires reliable network communication from leaders to followers.
- Check ISR Membership
Ensure replicas are properly joining the In-Sync Replica (ISR) set, as this is required for transition to push mode.
- Gradual Rollback
If issues persist, consider temporarily disabling Intelligent Replication by setting
confluent.intelligent.replication.enable=falseand restarting brokers.