Troubleshoot Intelligent Replication in Confluent Private Cloud

This topic describes common issues you might encounter with Intelligent Replication and how to resolve them.

Resolve high memory buffer usage in push replication

Symptoms

The PushManagerMemoryBytesUsed metric is consistently high.

Cause

This indicates that produce traffic is overwhelming replication capacity or followers are slow. This can lead to memory pressure and degraded performance.

Solution

Investigate high ingress rates, slow followers, or network/storage issues.
Consider temporarily reducing produce rates or investigating follower performance.
Monitor the metric over time to ensure it returns to normal levels.

Resolve push replication sessions ending frequently

Symptoms

The PushSessionEndCount metric shows frequent session terminations.

Cause

This indicates push replication instability. The broker automatically transitions to use pull replication during such failures.

Solution

Monitor reason tags - frequent REQUEST_NON_RETRIABLE_ERROR or LEADER_REPLICATION_ERROR may indicate issues requiring push replication disabling.
Normal operational reasons like NEW_LEADER_EPOCH are expected during rolls.
If error rates are high, consider disabling intelligent replication until underlying issues are resolved.

Resolve push sessions stuck in stopping state

Symptoms

The StoppingPushSessionsCount metric is persistently high.

Cause

Followers may not be receiving session end notifications properly, potentially leading to Under-Replicated Partitions (URPs).

Solution

If persistently high, followers may not be receiving session end notifications properly, potentially leading to URPs.
Consider manual leadership changes or disabling intelligent replication if the issue persists.
Check network connectivity between leaders and followers.

Resolve high push replication event processing latency

Symptoms

The PushEventQueueProcessingTimeMs metric shows consistently high processing times.

Cause

This indicates performance bottlenecks in the push replication mechanism.

Solution

If processing times are consistently high, investigate broker resource constraints (CPU, memory).
Check for network issues.
Consider tuning push replication configurations:
- confluent.intelligent.replication.push.max.threads
- confluent.intelligent.replication.push.threads.per.remote.broker

Resolve push replication event processing failures

Symptoms

The PushEventProcessingFailure counter is frequently increasing.

Cause

This indicates system health issues in the push replication event processing.

Solution

If this counter is frequently increasing, investigate broker logs for specific error details.
Common causes include:
- Resource exhaustion
- Configuration issues
- Internal state inconsistencies
Consider temporarily disabling intelligent replication if failure rates are high and impacting system stability.

Resolve followers stuck in pull mode

Symptoms

The FollowersAwaitingPushTransition metric is persistently high.

Cause

This indicates that continuous incoming records may be preventing followers from ever fully catching up to transition to push mode.

Solution

If this metric is persistently high, investigate if high ingress rates are constantly moving the LEO target.
You may need to temporarily reduce produce rates to allow followers to catch up.

Resolve low number of partitions using push replication

Symptoms

The PushPartitionsCount metric is unexpectedly low after enabling push replication.

Cause

This indicates that partitions aren’t transitioning to push mode as expected.

Solution

Use this metric to confirm push replication is enabled and functioning.
If unexpectedly low after enabling push replication, investigate why partitions aren’t transitioning to push mode (may indicate ISR issues or configuration problems).
Check that Intelligent Replication is properly enabled and that cluster conditions allow for push replication transitions.

Resolve frequent transitions from push to pull replication

Symptoms

The PullTransitionsCount metric is frequently increasing.

Cause

This indicates push replication instability and followers are falling back to pull replication.

Solution

If frequently increasing, investigate underlying causes:
- Network instability
- Follower performance issues
- ISR membership problems
High transition rates may indicate push replication should be disabled until issues are resolved.
Monitor network connectivity and follower broker health.

Understanding replication sessions

Intelligent Replication uses replication session IDs to coordinate transitions between push and pull modes. Understanding these sessions can help with troubleshooting:

Session Lifecycle

Each partition replica maintains a replication session ID.
Sessions start in pull mode and can transition to push mode.
When issues occur, sessions transition back to pull mode with a new session ID.

Session State Indicators

Check the kafka-replica-status CLI tool to see current replication modes. For more information about using this tool, see Monitoring replicas.
Monitor PushSessionEndCount for session transition frequency.
Look for session-related messages in broker logs.

Common Session Issues

Frequent session transitions: May indicate network instability or follower performance issues.
Stuck sessions: Check StoppingPushSessionsCount for sessions that aren’t properly ending.
Session ID mismatches: Usually resolved automatically but may indicate timing issues.

Follow general troubleshooting steps

Check Broker Logs: Examine broker logs for specific error messages related to Intelligent Replication, particularly messages about session transitions and replication mode changes.
Verify Configuration: Ensure that confluent.intelligent.replication.enable=true is properly set and that brokers have been restarted.
Monitor System Resources: Check CPU, memory, and network utilization on broker nodes. Push replication should reduce CPU usage compared to pull replication.
Test Network Connectivity: Verify network connectivity between leader and follower brokers. Push replication requires reliable network communication from leaders to followers.
Check ISR Membership: Ensure replicas are properly joining the In-Sync Replica (ISR) set, as this is required for transition to push mode.
Gradual Rollback: If issues persist, consider temporarily disabling Intelligent Replication by setting confluent.intelligent.replication.enable=false and restarting brokers.