Migrate Confluent Platform from ZooKeeper to KRaft using Ansible Playbooks

Ansible Playbooks for Confluent Platform (Confluent Ansible) supports migration from a ZooKeeper-based Confluent Platform deployment to a KRaft-based deployment.

To safely migrate your hosts and to achieve zero downtime, Confluent Ansible performs rolling upgrades host by host, shutting down the component, upgrading packages, restarting the service, and validating service health before moving on to the next one.

Requirements and considerations

  • Confluent Ansible only supports migration over the same Confluent Platform version of 7.6 or later.

    Note that migrating Confluent Platform 7.6.0 clusters is not recommended for production environments.

  • You need to upgrade Confluent Platform first before running the migration.

    You cannot upgrade of Confluent Platform version and migrate ZooKeeper to KRaft at the same time.

  • You can upgrade from ZooKeeper to KRaft in isolated mode (the controller having process.roles=controller and broker having process.roles=broker)

    You cannot migrate to the combined mode where KRaft and brokers are on the same process (role=controller, broker).

  • You can migrate co-located clusters where ZooKeeper and brokers are on the same node.

  • You can migrate ZooKeeper cluster to KRaft running on the same node.

    Beware of port collisions if co-locating components on the same host. If ZooKeeper and KRaft Controller are co-located, use the variables kafka_controller_jolokia_port and kafka_controller_jmxexporter_port to define different ports for ZooKeeper and KRaft. For example, kafka_controller_jolokia_port: 7777 and kafka_controller_jmxexporter_port: 8081.

  • ACL is migrated from ZooKeeper to KRaft.

  • Confluent Ansible supports one-to-many or many-to-one mapping of ZooKeeper to KRaft controllers where the number of ZooKeeper nodes differs from the number of controller nodes.

  • Confluent Ansible supports migration with the same cluster configurations.

    Different security protocols on the ZooKeeper cluster and the KRaft cluster are not recommended in migration.

Migrate to KRaft

To migrate a ZooKeeper-based Confluent Platform to KRaft:

  1. Prepare the inventory file. Use the same inventory file you used for ZooKeeper cluster setup.

    1. Enable the migration flag:

      all:
        vars:
          kraft_migration: true
      
    2. Add the kafka_controller host to your inventory file. For example:

      kafka_controller:
        hosts:
          ip1.us-west-2.compute.amazonaws.com:
      
      zookeeper:
        hosts:
          ip2.us-west-2.compute.amazonaws.com:
      
      kafka_broker:
        hosts:
          ip3.us-west-2.compute.amazonaws.com:
      
    3. If migrating a cluster that is configured with SASL/SCRAM, set the controller-to-controller authentication method using the kafka_controller_sasl_protocol variable in the hosts.yml inventory file.

      For example:

      all:
        vars:
          sasl_protocol: scram
      
      kafka_controller:
        vars:
          kafka_controller_sasl_protocol: plain,scram
      

      For details and example usage of the variable, see Configure SASL/SCRAM authentication.

    4. If migrating a cluster with Cluster Linking configured, you must set the password.encoder.secret and password.encoder.old.secret properties in the KRaft controller during migration. Use the same password.encoder.secret and password.encoder.old.secret (if set for Kafka) values in Kafka, and specify them in the KRaft custom properties section (kafka_controller_custom_properties:) in the inventory file.

      all:
        vars:
          kafka_controller_custom_properties:
            password.encoder.secret=<encoder-secret>
            password.encoder.old.secret=<encoder-old-secret>
      

      After migration, you can remove these properties.

      If not set, Cluster Linking will stop working after migrating to KRaft, and you will have to re-create the cluster links after migration.

  2. Run the migration playbook.

    You have two options: migrate in two steps or migrate in one step

    • Migrate in two steps with validation in between

      This is the recommended way because rollback can only be done till the cluster is in the Dual Write mode. This workflow allows you to stop to ensure migration completion before moving to complete the KRaft state.

      To migrate in two steps:

      1. Phase 1: Migrate to the Dual Write mode.

        ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
          --tags migrate_to_dual_write
        
      2. Validation phase:

        1. Check and verify that all data has been migrated without any loss.

        2. Update the ACL authorizer.

          If using ACLs, change the authorizer from AclAuthorizer used for ZooKeeper to StandardAuthorizer used for KRaft.

          The authorizer must be updated in both the KRaft controller and broker.

          kafka_broker_custom_properties:
            authorizer.class.name: org.apache.kafka.metadata.authorizer.StandardAuthorizer
          
          kafka_controller_custom_properties:
            authorizer.class.name: org.apache.kafka.metadata.authorizer.StandardAuthorizer
          
      3. Phase 2: complete the migration to the KRaft mode:

        ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
          --tags migrate_to_kraft
        
    • Migrate in one step

      If you want to migrate in one step without pausing at the Dual Write mode, run the following command without tags.

      ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml
      
  3. Once the cluster is running in KRaft mode, you can stop your ZooKeeper if the ZooKeeper is not managing multiple Kafka clusters.

  4. Remove the ZooKeeper section and the migration flag (set in the first step above) from your inventory file.

Roll back to ZooKeeper

If the migration runs into problems, you can roll back to ZooKeeper. While the cluster is in a dual-write state, the controller makes dual writes to KRaft and ZooKeeper, so the data in ZooKeeper stays consistent with the KRaft metadata log and it is still possible to revert.

Important

You can roll back only before the cluster is finalized into KRaft mode. After you take the controllers out of migration mode and restart in KRaft mode (the FINALIZED state), the rollback is no longer possible.

Confluent Ansible automates the rollback through the same confluent.platform.ZKtoKraftMigration.yml playbook used for the migration, driven by CLI tags. The playbook automatically detects the current migration state (PREMIGRATION, HYBRID_DUAL_WRITE, PURE_DUAL_WRITE, or FINALIZED) and runs only the stages required for that state. If the cluster is already in the FINALIZED state, the playbook stops before making any changes.

Use one of the following tags depending on how far you want to roll back.

To revert the brokers from KRaft to hybrid (dual-write) mode while the KRaft controllers stay active, use the rollback_to_hybrid tag. Use this tag when the cluster is in the PURE_DUAL_WRITE state and you want a partial rollback to the hybrid mode:

ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
  --tags rollback_to_hybrid

To fully revert the cluster to pure ZooKeeper (the pre-migration state), use the rollback_to_premigration tag. Use this tag from any pre-FINALIZED state when you want to completely roll back the migration:

ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
  --tags rollback_to_premigration

Roll back manually

If you prefer to roll back to ZooKeeper manually, complete the following steps while the cluster is still in a dual-write state:

  1. For each KRaft broker:

    1. Take each KRaft broker down.

    2. Restart the broker in ZooKeeper mode.

  2. Perform a clean shutdown of the KRaft controller quorum.

    A clean shutdown of the KRaft quorum is important because there may be uncommitted metadata waiting to be written to ZooKeeper. A forceful shutdown could cause some metadata to be lost.

  3. Using the ZooKeeper shell, delete the /controller node so that a ZooKeeper-based broker can become the next controller.

Troubleshoot migration issues

This section describes a few of the potential issues you might encounter while migrating ZooKeeper to KRaft and presents the steps to troubleshoot the issues.

Migration failed with the error:
{"attempts": 10, "cache_control": "no-cache", "changed": false,
"content_type": "text/plain; charset=utf-8", "cookies": {}, "cookies_string":
"", "date": "Fri, 19 Jan 2024 11:12:14 GMT", "elapsed": 0, "expires": "Fri,
19 Jan 2024 10:12:14 GMT", "json": {"request": {"mbean":
"kafka.controller:name=ZkMigrationState,type=KafkaController", "type":
"read"}, "status": 200, "timestamp": 1705662734, "value": {"Value": 2}},
"msg": "OK (unknown bytes)", "pragma": "no-cache", "redirected": false,
"status": 200, "transfer_encoding": "chunked", "url":
"https://localhost:7770/jolokia/read/kafka.controller:type=KafkaController,name=ZkMigrationState"}

Solution: Increase the metadata_migration_retries value. Due to the size of the cluster, it might be taking more time to migrate than expected.

Migration failed with the error:
{"msg": "The conditional check '( jolokia_output.content | from_json
).value.Value == 1' failed. The error was: Expecting value: line 1 column 1
(char 0)"}

One of the following can cause the error:

  • The KRaft controller has failed. For details of the failure, see server.logs of the KRaft controller.

  • Jolokia is disabled in the the KRaft controller.

Solution: Enable Jolokia in the KRaft controller if needed. Or review and address the issue in server.log.

Migration failed with the error:
{"msg": "The conditional check '( jolokia_output.content | from_json
).value.Value == 1' failed. The error was: error while evaluating
conditional (( jolokia_output.content | from_json ).value.Value == 1):
'dict object' has no attribute 'value'"}

Confluent Ansible playbooks are using an older version of Confluent Platform with confluent_package_version set to 7.5 or earlier.

Solution: Use Confluent Platform 7.6 or later.

Migration failed with an authorization error on an RBAC cluster.

When migrating an RBAC cluster, the KRaft controller and the Kafka broker need to share their principals.

Solution: Add the Kafka and KRaft controller principals in the super user variables for both the Kafka broker and the KRaft controller.

Migration failed with the error when you have mTLS with RBAC and custom user:
ERROR Exiting Kafka due to fatal exception (kafka.Kafka$)
java.nio.file.AccessDeniedException: /etc/controller/server.properties
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)

Workaround: Define kafka_controller_user and kafka_controller_group variables under all and kafka_broker in your inventory file if it is currently defined in the kraft_controller section only.

all:
  kafka_controller_user: <user>
  kafka_controller_group: <group>

kafka_broker:
  kafka_controller_user: <user>
  kafka_controller_group: <group>