Troubleshoot ZooKeeper to KRaft Migration Issues
This topic helps you diagnose and resolve issues during ZooKeeper to KRaft migration.
Troubleshoot migration issues
Perform these tasks to troubleshoot issues during a migration:
Collect the support bundle when the migration job is stuck.
Get the output of the command, if possible:
kubectl get kraftmigrationjob <migration job name> \ -n <namespace> \ -oyaml > kraftmigrationjob.yaml
Enable debug logging for CFK if the CFK log lacks sufficient logs to troubleshoot the migration issue.
Look for the CFK log entry that confirms when the migration job is started. For example:
2024-10-15T19:52:31.435Z INFO kraftmigrationjob log/log.go:31 Migration current status SETUP/SubPhaseSetupAddMigrationAnnotation: <nil>, requeuing ... {"name": "kraftmigrationjob", "namespace": "my-namespace"Understand the phase/subphase flow and the log message in the CFK log. For example:
2024-10-15T18:42:56.734Z INFO kraftmigrationjob log/log.go:31 Migration current status MIGRATE/SubPhaseMigrateEnsureKafkaRollComplete: <nil>, requeuing ...{"name": "kraftmigrationjob", "namespace": "my-namespace"} 2024-10-15T18:42:56.740Z INFO kraftmigrationjob log/log.go:31 Migration current status MIGRATE/SubPhaseMigrateEnsureKafkaRollComplete: kafka [kafka] in namespace [my-namespace] roll not complete yet, requeuing {"name": "kraftmigrationjob", "namespace": "my-namespace"}To help with debugging, enable TRACE level logging for metadata migration:
log4j.logger.org.apache.kafka.metadata.migration=TRACE, stdout
Check if you are running into a Known Issues in ZooKeeper to KRaft Migration.
Resolve common migration issues
This section describes common issues you may encounter during migration, along with their symptoms, root causes, and resolutions.
Issue 1: KRaftController pods stuck in Init/Pending state after starting migration
Symptoms:
KRaftController pods remain in Init or Pending state even after you apply KRaftMigrationJob.
Migration job phase is stuck at
SubPhaseSetupMutateKRaftControllerorSubPhaseSetupKRaftControllerHealthy.No progress for more than 15 minutes after starting migration.
Root Cause:
Missing or incorrect
platform.confluent.io/kraft-migration-hold-krc-creationannotation.Missing
platform.confluent.io/use-log4j1annotation. This is for CFK 3.0 or later.KRaftController CR not created with proper cluster ID or migration configuration.
Resolution:
Check if hold annotation exists:
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
-o jsonpath='{.metadata.annotations.platform\.confluent\.io/kraft-migration-hold-krc-creation}'
Check if Log4j1 annotation exists (for CFK 3.0 or later):
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
-o jsonpath='{.metadata.annotations.platform\.confluent\.io/use-log4j1}'
Check Kubernetes operator logs for specific errors:
kubectl logs deployment/confluent-operator -n <namespace> | grep -i kraftcontroller
If annotations are missing, the migration job adds them automatically during the SETUP phase. If stuck, verify the KRaftMigrationJob dependencies match your actual resource names.
Issue 2: Migration stuck at MIGRATE phase - metadata not copying
Symptoms:
Migration phase shows
MIGRATEwith subphaseSubPhaseMigrateMonitorMigrationProgress.Stuck for more than 30 minutes.
Migration never reaches the DUAL-WRITE phase.
KRaftController logs show authentication or authorization errors.
Root Cause:
RBAC configuration mismatch between Kafka and KRaftController.
KRaftController cannot authenticate to the Kafka cluster.
Missing super user permissions.
Password encoder configuration not migrated for cluster linking.
Resolution:
Check KRaftController logs for authentication errors:
kubectl logs <kraftcontroller-pod-name> -n <namespace> | grep -i "auth\|permission\|denied"
Verify role-based access control (RBAC) configuration matches:
kubectl get kafka <kafka-name> -n <namespace> -o jsonpath='{.spec.authorization.superUsers}'
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
-o jsonpath='{.spec.authorization.superUsers}'
Check if KRaftController can reach Metadata Service (MDS):
kubectl exec <kraftcontroller-pod-name> -n <namespace> -- curl -k -s -o /dev/null \
-w "%{http_code}" https://kafka:8090/security/1.0/authenticate
List both User:kafka and User:kraftcontroller as super users in both Kafka and KRaftController CRs. If you use cluster linking, verify that the password encoder configuration exists in KRaftController.
Issue 3: Wrong IBP version applied
Symptoms:
Kafka brokers fail to start after migration job begins.
Error message: “Invalid value X.X for configuration inter.broker.protocol.version”.
Migration stuck in SETUP phase.
Root Cause:
IBP version annotation does not match Confluent Platform version.
Resolution:
Check Confluent Platform version:
kubectl get kafka <kafka-name> -n <namespace> -o jsonpath='{.spec.image.application}'
Expected output example: confluentinc/cp-server:7.9.0
Check IBP mapping. Refer to the |cp| Upgrade Guide:
Confluent Platform Version | Kafka Version | Required IBP |
|---|---|---|
7.6.x | 3.6.x | 3.6 |
7.7.x | 3.7.x | 3.7 |
7.8.x | 3.8.x | 3.8 |
7.9.x | 3.9.x | 3.9 |
Fix the annotation by updating with the correct IBP version:
kubectl annotate kafka <kafka-name> \
platform.confluent.io/kraft-migration-ibp-version="<correct-version>" \
-n <namespace> --overwrite
Delete the migration job:
kubectl delete kraftmigrationjob <migration-job-name> -n <namespace>
Reapply the migration job:
kubectl apply -f kraftmigrationjob.yaml
Issue 4: Migration job status shows FAILURE
Symptoms:
KRaftMigrationJob phase is FAILURE.
Migration stopped progressing.
Resolution:
Get detailed error message:
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml
Look at status.message and status.conditions in the output.
Check CFK operator logs:
kubectl logs deployment/confluent-operator -n <namespace> --tail=100 | grep -i error
Verify resource locks are in place:
kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock
Collect support bundle:
Follow instructions at Support bundle.
Contact Confluent Support:
Provide support bundle
Include kraftmigrationjob YAML
Include CFK operator logs
Describe when error occurred (which phase/subphase)
Issue 5: Kafka broker rolls taking unusually long (>10 minutes per pod)
Symptoms:
Each Kafka pod restart takes more than 10 minutes.
Migration SETUP phase exceeds 2 hours for three-broker cluster.
Pods show Running status but are not ready.
Readiness probes failing.
Root Cause:
Readiness probe timeout too aggressive for migration workload.
Large metadata size requiring extended sync time.
Under-provisioned resources (CPU or memory).
Network latency between Kafka and KRaftController.
Resolution:
Check pod readiness status:
kubectl get pods -n <namespace> -l app=kafka -o wide
Check pod events for readiness probe failures:
kubectl describe pod <kafka-pod-name> -n <namespace> | grep -A 10 Events
Check pod resource usage:
kubectl top pod <kafka-pod-name> -n <namespace>
Review Kafka logs for slow operations:
kubectl logs <kafka-pod-name> -n <namespace> | grep -i "slow\|timeout\|loading"
If readiness probes are failing, verify the probe configuration in your Kafka CR allows sufficient time (for example, initialDelaySeconds: 60, timeoutSeconds: 10). For large clusters with extensive metadata, this behavior is expected. Allow additional time for metadata synchronization.
Issue 6: DUAL-WRITE validation fails - ZkMigrationState shows 0 instead of 1
Symptoms:
Migration job phase shows
DUAL-WRITE.Jolokia metric
ZkMigrationStatereturns0(PRE_MIGRATION) instead of1(DUAL_WRITE).Cluster appears to still be in pre-migration mode.
Topic operations create metadata in ZooKeeper but not in KRaft.
Root Cause:
Migration process did not fully complete the transition to DUAL-WRITE.
Kafka broker migration configuration not properly applied.
Controller broker not elected or migration leader not established.
Network partition between Kafka and KRaftController.
Resolution:
Check migration job status message:
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \
-o jsonpath='{.status.message}'
Verify Kafka has migration configuration applied:
kubectl exec <kafka-pod-name> -n <namespace> -- \
grep "controller.quorum.voters" /opt/confluentinc/etc/kafka/kafka.properties
Check if migration leader is elected:
kubectl logs <kafka-pod-name> -n <namespace> | grep -i "migration.*leader"
Verify connectivity between Kafka and KRaftController:
kubectl exec <kafka-pod-name> -n <namespace> -- nc -zv kraftcontroller-0.kraftcontroller 9093
If the migration job status shows DUAL-WRITE but Jolokia shows 0, check CFK operator logs for errors during the MIGRATE phase.
Check if migration annotation is present on Kafka CR:
kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep migration
Issue 7: Post-migration: Cannot modify Kafka CR - "resource locked" error
Symptoms:
After the migration completes, attempts to modify the Kafka CR fail with an error.
Webhook blocks changes with message about migration lock.
Cannot scale Kafka cluster or update configuration.
GitOps/FluxCD reconciliation fails.
Root Cause:
Migration locks not released after migration completion.
Missing
platform.confluent.io/kraft-migration-release-cr-lockannotation.Migration job still holds locks on Kafka, ZooKeeper, and KRaftController CRs.
Resolution:
Check if migration locks are still present:
kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock
Release migration locks:
kubectl annotate kraftmigrationjob <migration-job-name> \
platform.confluent.io/kraft-migration-release-cr-lock=true \
-n <namespace>
Verify locks are removed from Kafka CR:
kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock
Expected: No output (locks removed)
Verify locks are removed from KRaftController CR:
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> -o yaml | \
grep kraft-migration-cr-lock
Expected: No output (locks removed)
After releasing locks, you can modify the Kafka and KRaftController CRs normally. Releasing locks is a mandatory post-migration task documented in ZooKeeper to KRaft Migration.
Is my migration stuck or just slow?
Use this guide to determine if your migration is stuck or progressing normally.
Quick check commands
Run these commands in order:
Check current phase:
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \ -o jsonpath='{.status.phase}'
Check current subphase:
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \ -o jsonpath='{.status.subPhase}'
Check how long in current phase:
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml | \ grep lastTransitionTime
Check for errors:
kubectl logs deployment/confluent-operator -n <namespace> --tail=50 | \ grep -i error
Check pod status:
kubectl get pods -n <namespace> # Look for: CrashLoopBackOff, Error, or Pending (for >10 min)
Decision criteria
Migration is progressing:
Pods are rolling or restarting one at a time
No ERROR messages in CFK logs
Phase progression: SETUP, MIGRATE, DUAL-WRITE
Total time in current phase is less than 45 minutes
Migration is slow but OK:
Kafka or KRaftController pod is restarting
CFK logs show “waiting for roll complete”
No errors in logs
This is normal during broker rolls. Be patient.
Migration is likely stuck:
Pods in CrashLoopBackOff
ERROR messages in CFK operator logs
Kafka or KRaftController pods not Running
Phase is FAILURE
Detailed phase expectations
Phase/Subphase | What’s Happening | Action |
|---|---|---|
| Kafka rolls, KRaftController starts | Wait, monitor pods |
| Waiting for Kafka roll | Watch: |
| KRaftController pods starting | Wait for KRaftController Running |
| Metadata copy, Kafka roll | Wait, monitor logs |
| Waiting for Kafka roll | Watch: |
| Waiting for you | No action needed |
What to do if stuck
If you think your migration is stuck:
Collect information:
# Save migration job status
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml > kmj-status.yaml
# Save CFK logs
kubectl logs deployment/confluent-operator -n <namespace> > cfk-operator.log
# Save pod status
kubectl get pods -n <namespace> -o yaml > all-pods.yaml
# Save events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > events.log
Check pod logs if any pod is not Running
kubectl logs <stuck-pod> -n <namespace>
kubectl describe pod <stuck-pod> -n <namespace>
Contact Confluent Support:
Provide all collected files
Specify: CFK version, Confluent Platform version, which phase or subphase is stuck
Include timeline (when started, how long stuck)
Prevention tips
To avoid getting stuck:
Verify all prerequisites before starting
Ensure nodes have sufficient resources (CPU, memory, disk)
Test in non-production first
Use correct IBP version for your Confluent Platform version
Apply Log4j1 annotation if using CFK 3.0 or later
Have CFK operator logs accessible for monitoring