Troubleshoot ZooKeeper to KRaft Migration Issues

This topic helps you diagnose and resolve issues during ZooKeeper to KRaft migration.

Troubleshoot migration issues

Perform these tasks to troubleshoot issues during a migration:

Collect the support bundle when the migration job is stuck.

Get the output of the command, if possible:

kubectl get kraftmigrationjob <migration job name> \
  -n <namespace> \
  -oyaml > kraftmigrationjob.yaml

Enable debug logging for CFK if the CFK log lacks sufficient logs to troubleshoot the migration issue.

Look for the CFK log entry that confirms when the migration job is started. For example:

2024-10-15T19:52:31.435Z  INFO  kraftmigrationjob log/log.go:31
Migration current status SETUP/SubPhaseSetupAddMigrationAnnotation: <nil>, requeuing ...  {"name": "kraftmigrationjob", "namespace": "my-namespace"

Understand the phase/subphase flow and the log message in the CFK log. For example:

2024-10-15T18:42:56.734Z  INFO     kraftmigrationjob  log/log.go:31
Migration current status MIGRATE/SubPhaseMigrateEnsureKafkaRollComplete: <nil>, requeuing ...{"name": "kraftmigrationjob", "namespace": "my-namespace"}

2024-10-15T18:42:56.740Z  INFO     kraftmigrationjob  log/log.go:31
Migration current status MIGRATE/SubPhaseMigrateEnsureKafkaRollComplete: kafka [kafka] in namespace [my-namespace] roll not complete yet, requeuing {"name": "kraftmigrationjob", "namespace": "my-namespace"}

To help with debugging, enable TRACE level logging for metadata migration:
```
log4j.logger.org.apache.kafka.metadata.migration=TRACE, stdout
```
Check if you are running into a Known Issues in ZooKeeper to KRaft Migration.

Watch pods and operator logs in separate terminals

During long-running migration phases, watch pods and operator logs in parallel terminals to diagnose stalls or unexpected behavior. The migration job watch in Step 4 is sufficient for most migrations. Use these additional terminals only if you need deeper visibility.

Watch pod status:

kubectl get pods -n <namespace> -w

Watch operator logs:

kubectl logs -f deployment/confluent-operator -n <operator-namespace> | grep -i migration

Check for errors if migration stalls

If a subphase takes longer than expected during monitoring (Step 4), or before you finalize during validation (Step 5.2), check the migration job status, operator logs, and pod state for errors.

# Check migration job status
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml | grep -A 20 status

# Check operator for errors
kubectl logs deployment/confluent-operator -n <operator-namespace> --tail=100 | grep -i error

# Check pod status
kubectl get pods -n <namespace>

Expected: Current phase and subphase shown, no errors, all pods running or in normal rolling restarts.

To scan Kafka and KRaftController logs for migration-related errors:

kubectl logs deployment/confluent-operator -n <operator-namespace> --since=24h | grep -i error
kubectl logs <kafka-pod-name> -n <namespace> --since=24h | grep -E "ERROR|FATAL"
kubectl logs <kraftcontroller-pod-name> -n <namespace> --since=24h | grep -E "ERROR|FATAL"

Expected: No ERROR or FATAL messages related to KRaft migration, broker health, or controller state.

Validate RBAC before finalization

If your Kafka cluster has RBAC enabled, run these checks before triggering finalization in Step 5.3 of the migration procedure. A failed check indicates that MDS or ACL access is broken and finalization should not proceed.

Verify MDS endpoint is responding:

kubectl exec <kafka-pod-name> -n <namespace> -- curl -k -s -o /dev/null \
  -w "%{http_code}" https://<kafka-name>.<namespace>.svc.cluster.local:8090/security/1.0/authenticate

Expected: 401 when called without credentials, or 200 when called with valid credentials. Both indicate MDS is up and serving auth requests. Treat 503, 5xx, connection refused, or a timeout as a failure and investigate before finalizing.

Port 8090 is the CFK default for MDS. If your Kafka CR exposes MDS on a different port, substitute it.

Verify ACLs are accessible:

kubectl exec <kafka-pod-name> -n <namespace> -- kafka-acls --list \
  --bootstrap-server <kafka-name>.<namespace>.svc.cluster.local:9071 \
  --command-config /path/to/client.properties

Expected: ACL list is displayed without errors.

Resolve common migration issues

This section describes common issues you may encounter during migration, along with their symptoms, root causes, and resolutions.

Issue 1: KRaftController pods stuck in Init/Pending state after starting migration

Symptoms:

KRaftController pods remain in Init or Pending state even after you apply KRaftMigrationJob.
Migration job phase is stuck at SubPhaseSetupMutateKRaftController or SubPhaseSetupKRaftControllerHealthy.
No progress for more than 15 minutes after starting migration.

Root Cause:

Missing or incorrect platform.confluent.io/kraft-migration-hold-krc-creation annotation.
Missing platform.confluent.io/use-log4j1 annotation. This is for CFK 3.0 or later.
KRaftController CR not created with proper cluster ID or migration configuration.

Resolution:

Check if hold annotation exists:

kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
  -o jsonpath='{.metadata.annotations.platform\.confluent\.io/kraft-migration-hold-krc-creation}'

Check if Log4j1 annotation exists (for CFK 3.0 or later):

kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
  -o jsonpath='{.metadata.annotations.platform\.confluent\.io/use-log4j1}'

Check Kubernetes operator logs for specific errors:

kubectl logs deployment/confluent-operator -n <namespace> | grep -i kraftcontroller

If annotations are missing, the migration job adds them automatically during the SETUP phase. If stuck, verify the KRaftMigrationJob dependencies match your actual resource names.

Issue 2: Migration stuck at MIGRATE phase - metadata not copying

Symptoms:

Migration phase shows MIGRATE with subphase SubPhaseMigrateMonitorMigrationProgress.
Stuck for more than 30 minutes.
Migration never reaches the DUAL-WRITE phase.
KRaftController logs show authentication or authorization errors.

Root Cause:

RBAC configuration mismatch between Kafka and KRaftController.
KRaftController cannot authenticate to the Kafka cluster.
Missing super user permissions.
Password encoder configuration not migrated for cluster linking.

Resolution:

Check KRaftController logs for authentication errors:

kubectl logs <kraftcontroller-pod-name> -n <namespace> | grep -i "auth\|permission\|denied"

Verify role-based access control (RBAC) configuration matches:

kubectl get kafka <kafka-name> -n <namespace> -o jsonpath='{.spec.authorization.superUsers}'
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
  -o jsonpath='{.spec.authorization.superUsers}'

Check if KRaftController can reach Metadata Service (MDS):

kubectl exec <kraftcontroller-pod-name> -n <namespace> -- curl -k -s -o /dev/null \
  -w "%{http_code}" https://kafka:8090/security/1.0/authenticate

List both User:kafka and User:kraftcontroller as super users in both Kafka and KRaftController CRs. If you use cluster linking, verify that the password encoder configuration exists in KRaftController.

Issue 3: Wrong IBP version applied

Symptoms:

Kafka brokers fail to start after migration job begins.
Error message: “Invalid value X.X for configuration inter.broker.protocol.version”.
Migration stuck in SETUP phase.

Root Cause:

IBP version annotation does not match Confluent Platform version.

Resolution:

Check Confluent Platform version:

kubectl get kafka <kafka-name> -n <namespace> -o jsonpath='{.spec.image.application}'

Expected output example: confluentinc/cp-server:7.9.0

Check IBP mapping. Refer to the |cp| Upgrade Guide:

Confluent Platform Version	Kafka Version	Required IBP
7.6.x	3.6.x	3.6
7.7.x	3.7.x	3.7
7.8.x	3.8.x	3.8
7.9.x	3.9.x	3.9

Fix the annotation by updating with the correct IBP version:

kubectl annotate kafka <kafka-name> \
  platform.confluent.io/kraft-migration-ibp-version="<correct-version>" \
  -n <namespace> --overwrite

Delete the migration job:

kubectl delete kraftmigrationjob <migration-job-name> -n <namespace>

Reapply the migration job:

kubectl apply -f kraftmigrationjob.yaml

Issue 4: Migration job status shows FAILURE

Symptoms:

KRaftMigrationJob phase is FAILURE.
Migration stopped progressing.

Resolution:

Get detailed error message:

kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml

Look at status.message and status.conditions in the output.

Check CFK operator logs:

kubectl logs deployment/confluent-operator -n <namespace> --tail=100 | grep -i error

Verify resource locks are in place:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock

Collect support bundle:

Follow instructions at Support bundle.

Contact Confluent Support:

Provide support bundle
Include kraftmigrationjob YAML
Include CFK operator logs
Describe when error occurred (which phase/subphase)

Issue 5: Kafka broker rolls taking unusually long (>10 minutes per pod)

Symptoms:

Each Kafka pod restart takes more than 10 minutes.
Migration SETUP phase exceeds 2 hours for three-broker cluster.
Pods show Running status but are not ready.
Readiness probes failing.

Root Cause:

Readiness probe timeout too aggressive for migration workload.
Large metadata size requiring extended sync time.
Under-provisioned resources (CPU or memory).
Network latency between Kafka and KRaftController.

Resolution:

Check pod readiness status:

kubectl get pods -n <namespace> -l app=kafka -o wide

Check pod events for readiness probe failures:

kubectl describe pod <kafka-pod-name> -n <namespace> | grep -A 10 Events

Check pod resource usage:

kubectl top pod <kafka-pod-name> -n <namespace>

Review Kafka logs for slow operations:

kubectl logs <kafka-pod-name> -n <namespace> | grep -i "slow\|timeout\|loading"

If readiness probes are failing, verify the probe configuration in your Kafka CR allows sufficient time (for example, initialDelaySeconds: 60, timeoutSeconds: 10). For large clusters with extensive metadata, this behavior is expected. Allow additional time for metadata synchronization.

Issue 6: DUAL-WRITE validation fails - ZkMigrationState shows 0 instead of 1

Symptoms:

Migration job phase shows DUAL-WRITE.
Jolokia metric ZkMigrationState returns 0 (PRE_MIGRATION) instead of 1 (DUAL_WRITE).
Cluster appears to still be in pre-migration mode.
Topic operations create metadata in ZooKeeper but not in KRaft.

Root Cause:

Migration process did not fully complete the transition to DUAL-WRITE.
Kafka broker migration configuration not properly applied.
Controller broker not elected or migration leader not established.
Network partition between Kafka and KRaftController.

Resolution:

Check migration job status message:

kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \
  -o jsonpath='{.status.message}'

Verify Kafka has migration configuration applied:

kubectl exec <kafka-pod-name> -n <namespace> -- \
  grep "controller.quorum.voters" /opt/confluentinc/etc/kafka/kafka.properties

Check if migration leader is elected:

kubectl logs <kafka-pod-name> -n <namespace> | grep -i "migration.*leader"

Verify connectivity between Kafka and KRaftController:

kubectl exec <kafka-pod-name> -n <namespace> -- nc -zv kraftcontroller-0.kraftcontroller 9093

If the migration job status shows DUAL-WRITE but Jolokia shows 0, check CFK operator logs for errors during the MIGRATE phase.

Check if migration annotation is present on Kafka CR:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep migration

Issue 7: Post-migration: Cannot modify Kafka CR - "resource locked" error

Symptoms:

After the migration completes, attempts to modify the Kafka CR fail with an error.
Webhook blocks changes with message about migration lock.
Cannot scale Kafka cluster or update configuration.
GitOps/FluxCD reconciliation fails.

Root Cause:

Migration locks not released after migration completion.
Missing platform.confluent.io/kraft-migration-release-cr-lock annotation.
Migration job still holds locks on Kafka, ZooKeeper, and KRaftController CRs.

Resolution:

Check if migration locks are still present:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock

Release migration locks:

kubectl annotate kraftmigrationjob <migration-job-name> \
  platform.confluent.io/kraft-migration-release-cr-lock=true \
  -n <namespace>

Verify locks are removed from Kafka CR:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock

Expected: No output (locks removed)

Verify locks are removed from KRaftController CR:

kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> -o yaml | \
  grep kraft-migration-cr-lock

Expected: No output (locks removed)

After releasing locks, you can modify the Kafka and KRaftController CRs normally. Releasing locks is a mandatory post-migration task documented in the migration procedure.

Issue 8: Incorrect 0.5DC cluster ID blocks migration across all datacenters

Applies to CFK 3.3.0 and later, in 2.5DC (two-and-a-half datacenter) deployments that use the lite-mode KRaftMigrationJob.

Symptoms:

All KRaftMigrationJob resources are stuck at the MIGRATE phase, with ZkMigrationState at PRE_MIGRATION (full-mode datacenters) or -1 (0.5DC).
The 0.5DC KRaftController logs show repeated INCONSISTENT_CLUSTER_ID errors in VOTE and FETCH responses.
The quorum leader logs show Still waiting for all controller nodes ready tobegin the migration. Not ready due to: No registration found for controller<0.5dc-node-id>.
The 0.5DC KRaftController pod reports Running and Ready even though it is not part of the quorum. The readiness probe does not check quorum membership.

Root Cause:

In a 2.5DC topology, the 0.5DC KRaftController requires a manually configured spec.clusterID, because the 0.5DC has no Kafka CR to derive it from. If the cluster ID is incorrect, the 0.5DC KRaft controller fails to join the quorum, and every VOTE and FETCH request is rejected with INCONSISTENT_CLUSTER_ID. This blocks migration across all datacenters, not only the 0.5DC, because the KRaft migration driver on the quorum leader does not proceed past PRE_MIGRATION until all configured voters register, and the 0.5DC controller can never register with a mismatched cluster ID.

Resolution:

Roll back the 0.5DC KRaftMigrationJob and complete the znode removal step. For details, see Roll Back to ZooKeeper.
Release the CR lock and delete the 0.5DC KRaftMigrationJob. For details, see Step 6.1 in Step 6: Complete post-migration.
Delete the 0.5DC KRaftController and its persistent volume claims (PVCs). The init container writes the incorrect cluster ID to the KRaft log directory during first startup. If you do not delete the PVCs, a redeployed KRaftController reuses the corrupted storage.
Fetch the correct cluster ID from an existing Kafka broker meta.properties file, set spec.clusterID on the 0.5DC KRaftController CR, and redeploy both the KRaftController and the KRaftMigrationJob. The other datacenters unblock automatically after the 0.5DC controller registers with the correct cluster ID.

To avoid this issue, always fetch the cluster ID from a Kafka broker meta.properties file before you deploy the 0.5DC KRaftController. Do not copy it from the Kafka CR status.clusterID field, because that field is not populated for ZooKeeper-based Kafka clusters.

Is my migration stuck or just slow?

Use this guide to determine if your migration is stuck or progressing normally.

Quick check commands

Run these commands in order:

Check current phase:

kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \
  -o jsonpath='{.status.phase}'

Check current subphase:

kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \
  -o jsonpath='{.status.subPhase}'

Check how long in current phase:

kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml | \
  grep lastTransitionTime

Check for errors:

kubectl logs deployment/confluent-operator -n <namespace> --tail=50 | \
  grep -i error

Check pod status:

kubectl get pods -n <namespace>
# Look for: CrashLoopBackOff, Error, or Pending (for >10 min)

Decision criteria

Migration is progressing:

Pods are rolling or restarting one at a time
No ERROR messages in CFK logs
Phase progression: SETUP, MIGRATE, DUAL-WRITE
Total time in current phase is less than 45 minutes

Migration is slow but OK:

Kafka or KRaftController pod is restarting
CFK logs show “waiting for roll complete”
No errors in logs

This is normal during broker rolls. Be patient.

Migration is likely stuck:

Pods in CrashLoopBackOff
ERROR messages in CFK operator logs
Kafka or KRaftController pods not Running
Phase is FAILURE

Detailed phase expectations

Phase/Subphase	What’s Happening	Action
`SETUP`	Kafka rolls, KRaftController starts	Wait, monitor pods
`SubPhaseSetupEnsureIBPUpgradeComplete`	Waiting for Kafka roll	Watch: `kubectl get pods`
`SubPhaseSetupKRaftControllerHealthy`	KRaftController pods starting	Wait for KRaftController Running
`MIGRATE`	Metadata copy, Kafka roll	Wait, monitor logs
`SubPhaseMigrateEnsureKafkaRollComplete`	Waiting for Kafka roll	Watch: `kubectl get pods`
`DUAL-WRITE`	Waiting for you	No action needed

What to do if stuck

If you think your migration is stuck:

Collect information:

# Save migration job status
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml > kmj-status.yaml

# Save CFK logs
kubectl logs deployment/confluent-operator -n <namespace> > cfk-operator.log

# Save pod status
kubectl get pods -n <namespace> -o yaml > all-pods.yaml

# Save events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > events.log

Check pod logs if any pod is not Running

kubectl logs <stuck-pod> -n <namespace>
kubectl describe pod <stuck-pod> -n <namespace>

Contact Confluent Support:

Provide all collected files
Specify: CFK version, Confluent Platform version, which phase or subphase is stuck
Include timeline (when started, how long stuck)

Prevention tips

To avoid getting stuck:

Verify all prerequisites before starting
Ensure nodes have sufficient resources (CPU, memory, disk)
Test in non-production first
Use correct IBP version for your Confluent Platform version
Apply Log4j1 annotation if using CFK 3.0 or later
Have CFK operator logs accessible for monitoring

Bypass CR locks for emergency changes

If you need to make an emergency change to a locked CR during migration with ValidatingAdmissionPolicy (VAP) enabled, add the bypass annotation to the CR.

Warning

Bypassing the CR lock allows any user to modify the CR. Use this only for emergency scenarios. Incorrect modifications during migration can leave the cluster in an inconsistent state.

kubectl annotate <kind> <name> -n <namespace> \
  platform.confluent.io/allow-request-during-kraft-migration=true

For example:

kubectl annotate kafka kafka -n confluent \
  platform.confluent.io/allow-request-during-kraft-migration=true

After the emergency change is complete, remove the bypass annotation:

kubectl annotate <kind> <name> -n <namespace> \
  platform.confluent.io/allow-request-during-kraft-migration-

For earlier CFK versions without VAP support, bypass webhook-enforced locks by adding a label to the CR:

kubectl label <kind> <name> -n <namespace> \
  confluent-operator.webhooks.platform.confluent.io/allow-request-during-kraft-migration=true

After the emergency change is complete, remove the bypass label:

kubectl label <kind> <name> -n <namespace> \
  confluent-operator.webhooks.platform.confluent.io/allow-request-during-kraft-migration-

For background on which lock mechanism your cluster uses (VAP versus webhook), see CR lock enforcement during migration on the prerequisites page.

Warning

Do not re-apply a previously exported KRaftController YAML file as-is. As the migration progresses, CFK removes setup annotations from the live CR. For example, CFK removes the platform.confluent.io/kraft-migration-hold-krc-creation annotation during SubPhaseSetupMutateKRaftController. Re-applying an older copy that still carries that annotation puts the KRaftController back into HOLD and stalls the migration.

Before re-applying, diff the file against the live CR and drop any annotation that CFK has already removed:

kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> -o yaml > kraftcontroller-live.yaml
diff kraftcontroller-live.yaml kraftcontroller-edit.yaml

Apply only the configuration change you intend to make, preserving the annotation set from the live CR.

Roll back errors

Errors most commonly seen during the manual znode-removal step of Roll Back to ZooKeeper.

NoAuthException

The zookeeper-shell session is missing ZK ACL or JAAS credentials. Confirm that you ran kubectl exec against a ZooKeeper pod (where JAAS is already mounted) rather than running zookeeper-shell from your local machine. When TLS is enabled, also confirm that you are using the correct client.properties file with the -zk-tls-config-file <path> flag.

Failed to delete some node(s)

The target znode either does not exist (it might have been deleted in a previous attempt) or has child nodes that deleteall could not remove. List the parent path to confirm the current state:

zookeeper-shell localhost:2181 ls /<kafka-cr-name>-<kafka-cr-namespace>

If the controller or migration znode is already gone, continue with the next step of the rollback procedure. If unexpected child znodes are present, escalate to support before proceeding.

Troubleshoot ZooKeeper to KRaft Migration Issues

Troubleshoot migration issues

Watch pods and operator logs in separate terminals

Check for errors if migration stalls

Validate RBAC before finalization

Resolve common migration issues

Is my migration stuck or just slow?

Quick check commands

Decision criteria

Detailed phase expectations

What to do if stuck

Prevention tips

Bypass CR locks for emergency changes

Roll back errors

Related content