Troubleshoot ZooKeeper to KRaft Migration Issues

This topic helps you diagnose and resolve issues during ZooKeeper to KRaft migration.

Troubleshoot migration issues

Perform these tasks to troubleshoot issues during a migration:

  • Collect the support bundle when the migration job is stuck.

  • Get the output of the command, if possible:

    kubectl get kraftmigrationjob <migration job name> \
      -n <namespace> \
      -oyaml > kraftmigrationjob.yaml
    
  • Enable debug logging for CFK if the CFK log lacks sufficient logs to troubleshoot the migration issue.

  • Look for the CFK log entry that confirms when the migration job is started. For example:

    2024-10-15T19:52:31.435Z  INFO  kraftmigrationjob log/log.go:31
    Migration current status SETUP/SubPhaseSetupAddMigrationAnnotation: <nil>, requeuing ...  {"name": "kraftmigrationjob", "namespace": "my-namespace"
    
  • Understand the phase/subphase flow and the log message in the CFK log. For example:

    2024-10-15T18:42:56.734Z  INFO     kraftmigrationjob  log/log.go:31
    Migration current status MIGRATE/SubPhaseMigrateEnsureKafkaRollComplete: <nil>, requeuing ...{"name": "kraftmigrationjob", "namespace": "my-namespace"}
    
    2024-10-15T18:42:56.740Z  INFO     kraftmigrationjob  log/log.go:31
    Migration current status MIGRATE/SubPhaseMigrateEnsureKafkaRollComplete: kafka [kafka] in namespace [my-namespace] roll not complete yet, requeuing {"name": "kraftmigrationjob", "namespace": "my-namespace"}
    
  • To help with debugging, enable TRACE level logging for metadata migration:

    log4j.logger.org.apache.kafka.metadata.migration=TRACE, stdout
    
  • Check if you are running into a Known Issues in ZooKeeper to KRaft Migration.

Resolve common migration issues

This section describes common issues you may encounter during migration, along with their symptoms, root causes, and resolutions.

Issue 1: KRaftController pods stuck in Init/Pending state after starting migration

Symptoms:

  • KRaftController pods remain in Init or Pending state even after you apply KRaftMigrationJob.

  • Migration job phase is stuck at SubPhaseSetupMutateKRaftController or SubPhaseSetupKRaftControllerHealthy.

  • No progress for more than 15 minutes after starting migration.

Root Cause:

  • Missing or incorrect platform.confluent.io/kraft-migration-hold-krc-creation annotation.

  • Missing platform.confluent.io/use-log4j1 annotation. This is for CFK 3.0 or later.

  • KRaftController CR not created with proper cluster ID or migration configuration.

Resolution:

Check if hold annotation exists:

kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
  -o jsonpath='{.metadata.annotations.platform\.confluent\.io/kraft-migration-hold-krc-creation}'

Check if Log4j1 annotation exists (for CFK 3.0 or later):

kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
  -o jsonpath='{.metadata.annotations.platform\.confluent\.io/use-log4j1}'

Check Kubernetes operator logs for specific errors:

kubectl logs deployment/confluent-operator -n <namespace> | grep -i kraftcontroller

If annotations are missing, the migration job adds them automatically during the SETUP phase. If stuck, verify the KRaftMigrationJob dependencies match your actual resource names.

Issue 2: Migration stuck at MIGRATE phase - metadata not copying

Symptoms:

  • Migration phase shows MIGRATE with subphase SubPhaseMigrateMonitorMigrationProgress.

  • Stuck for more than 30 minutes.

  • Migration never reaches the DUAL-WRITE phase.

  • KRaftController logs show authentication or authorization errors.

Root Cause:

  • RBAC configuration mismatch between Kafka and KRaftController.

  • KRaftController cannot authenticate to the Kafka cluster.

  • Missing super user permissions.

  • Password encoder configuration not migrated for cluster linking.

Resolution:

Check KRaftController logs for authentication errors:

kubectl logs <kraftcontroller-pod-name> -n <namespace> | grep -i "auth\|permission\|denied"

Verify role-based access control (RBAC) configuration matches:

kubectl get kafka <kafka-name> -n <namespace> -o jsonpath='{.spec.authorization.superUsers}'
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> \
  -o jsonpath='{.spec.authorization.superUsers}'

Check if KRaftController can reach Metadata Service (MDS):

kubectl exec <kraftcontroller-pod-name> -n <namespace> -- curl -k -s -o /dev/null \
  -w "%{http_code}" https://kafka:8090/security/1.0/authenticate

List both User:kafka and User:kraftcontroller as super users in both Kafka and KRaftController CRs. If you use cluster linking, verify that the password encoder configuration exists in KRaftController.

Issue 3: Wrong IBP version applied

Symptoms:

  • Kafka brokers fail to start after migration job begins.

  • Error message: “Invalid value X.X for configuration inter.broker.protocol.version”.

  • Migration stuck in SETUP phase.

Root Cause:

IBP version annotation does not match Confluent Platform version.

Resolution:

Check Confluent Platform version:

kubectl get kafka <kafka-name> -n <namespace> -o jsonpath='{.spec.image.application}'

Expected output example: confluentinc/cp-server:7.9.0

Check IBP mapping. Refer to the |cp| Upgrade Guide:

Confluent Platform Version

Kafka Version

Required IBP

7.6.x

3.6.x

3.6

7.7.x

3.7.x

3.7

7.8.x

3.8.x

3.8

7.9.x

3.9.x

3.9

Fix the annotation by updating with the correct IBP version:

kubectl annotate kafka <kafka-name> \
  platform.confluent.io/kraft-migration-ibp-version="<correct-version>" \
  -n <namespace> --overwrite

Delete the migration job:

kubectl delete kraftmigrationjob <migration-job-name> -n <namespace>

Reapply the migration job:

kubectl apply -f kraftmigrationjob.yaml
Issue 4: Migration job status shows FAILURE

Symptoms:

  • KRaftMigrationJob phase is FAILURE.

  • Migration stopped progressing.

Resolution:

Get detailed error message:

kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml

Look at status.message and status.conditions in the output.

Check CFK operator logs:

kubectl logs deployment/confluent-operator -n <namespace> --tail=100 | grep -i error

Verify resource locks are in place:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock
kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock

Collect support bundle:

Follow instructions at Support bundle.

Contact Confluent Support:

  • Provide support bundle

  • Include kraftmigrationjob YAML

  • Include CFK operator logs

  • Describe when error occurred (which phase/subphase)

Issue 5: Kafka broker rolls taking unusually long (>10 minutes per pod)

Symptoms:

  • Each Kafka pod restart takes more than 10 minutes.

  • Migration SETUP phase exceeds 2 hours for three-broker cluster.

  • Pods show Running status but are not ready.

  • Readiness probes failing.

Root Cause:

  • Readiness probe timeout too aggressive for migration workload.

  • Large metadata size requiring extended sync time.

  • Under-provisioned resources (CPU or memory).

  • Network latency between Kafka and KRaftController.

Resolution:

Check pod readiness status:

kubectl get pods -n <namespace> -l app=kafka -o wide

Check pod events for readiness probe failures:

kubectl describe pod <kafka-pod-name> -n <namespace> | grep -A 10 Events

Check pod resource usage:

kubectl top pod <kafka-pod-name> -n <namespace>

Review Kafka logs for slow operations:

kubectl logs <kafka-pod-name> -n <namespace> | grep -i "slow\|timeout\|loading"

If readiness probes are failing, verify the probe configuration in your Kafka CR allows sufficient time (for example, initialDelaySeconds: 60, timeoutSeconds: 10). For large clusters with extensive metadata, this behavior is expected. Allow additional time for metadata synchronization.

Issue 6: DUAL-WRITE validation fails - ZkMigrationState shows 0 instead of 1

Symptoms:

  • Migration job phase shows DUAL-WRITE.

  • Jolokia metric ZkMigrationState returns 0 (PRE_MIGRATION) instead of 1 (DUAL_WRITE).

  • Cluster appears to still be in pre-migration mode.

  • Topic operations create metadata in ZooKeeper but not in KRaft.

Root Cause:

  • Migration process did not fully complete the transition to DUAL-WRITE.

  • Kafka broker migration configuration not properly applied.

  • Controller broker not elected or migration leader not established.

  • Network partition between Kafka and KRaftController.

Resolution:

Check migration job status message:

kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \
  -o jsonpath='{.status.message}'

Verify Kafka has migration configuration applied:

kubectl exec <kafka-pod-name> -n <namespace> -- \
  grep "controller.quorum.voters" /opt/confluentinc/etc/kafka/kafka.properties

Check if migration leader is elected:

kubectl logs <kafka-pod-name> -n <namespace> | grep -i "migration.*leader"

Verify connectivity between Kafka and KRaftController:

kubectl exec <kafka-pod-name> -n <namespace> -- nc -zv kraftcontroller-0.kraftcontroller 9093

If the migration job status shows DUAL-WRITE but Jolokia shows 0, check CFK operator logs for errors during the MIGRATE phase.

Check if migration annotation is present on Kafka CR:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep migration
Issue 7: Post-migration: Cannot modify Kafka CR - "resource locked" error

Symptoms:

  • After the migration completes, attempts to modify the Kafka CR fail with an error.

  • Webhook blocks changes with message about migration lock.

  • Cannot scale Kafka cluster or update configuration.

  • GitOps/FluxCD reconciliation fails.

Root Cause:

  • Migration locks not released after migration completion.

  • Missing platform.confluent.io/kraft-migration-release-cr-lock annotation.

  • Migration job still holds locks on Kafka, ZooKeeper, and KRaftController CRs.

Resolution:

Check if migration locks are still present:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock

Release migration locks:

kubectl annotate kraftmigrationjob <migration-job-name> \
  platform.confluent.io/kraft-migration-release-cr-lock=true \
  -n <namespace>

Verify locks are removed from Kafka CR:

kubectl get kafka <kafka-name> -n <namespace> -o yaml | grep kraft-migration-cr-lock

Expected: No output (locks removed)

Verify locks are removed from KRaftController CR:

kubectl get kraftcontroller <kraftcontroller-name> -n <namespace> -o yaml | \
  grep kraft-migration-cr-lock

Expected: No output (locks removed)

After releasing locks, you can modify the Kafka and KRaftController CRs normally. Releasing locks is a mandatory post-migration task documented in ZooKeeper to KRaft Migration.

Is my migration stuck or just slow?

Use this guide to determine if your migration is stuck or progressing normally.

Quick check commands

Run these commands in order:

  1. Check current phase:

    kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \
      -o jsonpath='{.status.phase}'
    
  2. Check current subphase:

    kubectl get kraftmigrationjob <migration-job-name> -n <namespace> \
      -o jsonpath='{.status.subPhase}'
    
  3. Check how long in current phase:

    kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml | \
      grep lastTransitionTime
    
  4. Check for errors:

    kubectl logs deployment/confluent-operator -n <namespace> --tail=50 | \
      grep -i error
    
  5. Check pod status:

    kubectl get pods -n <namespace>
    # Look for: CrashLoopBackOff, Error, or Pending (for >10 min)
    

Decision criteria

Migration is progressing:

  • Pods are rolling or restarting one at a time

  • No ERROR messages in CFK logs

  • Phase progression: SETUP, MIGRATE, DUAL-WRITE

  • Total time in current phase is less than 45 minutes

Migration is slow but OK:

  • Kafka or KRaftController pod is restarting

  • CFK logs show “waiting for roll complete”

  • No errors in logs

This is normal during broker rolls. Be patient.

Migration is likely stuck:

  • Pods in CrashLoopBackOff

  • ERROR messages in CFK operator logs

  • Kafka or KRaftController pods not Running

  • Phase is FAILURE

Detailed phase expectations

Phase/Subphase

What’s Happening

Action

SETUP

Kafka rolls, KRaftController starts

Wait, monitor pods

SubPhaseSetupEnsureIBPUpgradeComplete

Waiting for Kafka roll

Watch: kubectl get pods

SubPhaseSetupKRaftControllerHealthy

KRaftController pods starting

Wait for KRaftController Running

MIGRATE

Metadata copy, Kafka roll

Wait, monitor logs

SubPhaseMigrateEnsureKafkaRollComplete

Waiting for Kafka roll

Watch: kubectl get pods

DUAL-WRITE

Waiting for you

No action needed

What to do if stuck

If you think your migration is stuck:

Collect information:

# Save migration job status
kubectl get kraftmigrationjob <migration-job-name> -n <namespace> -o yaml > kmj-status.yaml

# Save CFK logs
kubectl logs deployment/confluent-operator -n <namespace> > cfk-operator.log

# Save pod status
kubectl get pods -n <namespace> -o yaml > all-pods.yaml

# Save events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > events.log

Check pod logs if any pod is not Running

kubectl logs <stuck-pod> -n <namespace>
kubectl describe pod <stuck-pod> -n <namespace>

Contact Confluent Support:

  • Provide all collected files

  • Specify: CFK version, Confluent Platform version, which phase or subphase is stuck

  • Include timeline (when started, how long stuck)

Prevention tips

To avoid getting stuck:

  • Verify all prerequisites before starting

  • Ensure nodes have sufficient resources (CPU, memory, disk)

  • Test in non-production first

  • Use correct IBP version for your Confluent Platform version

  • Apply Log4j1 annotation if using CFK 3.0 or later

  • Have CFK operator logs accessible for monitoring