Migrate Confluent Platform from ZooKeeper to KRaft using Ansible Playbooks

Ansible Playbooks for Confluent Platform (Confluent Ansible) supports migration from a ZooKeeper-based Confluent Platform deployment to a KRaft-based deployment.

To safely migrate your hosts and to achieve zero downtime, Confluent Ansible performs rolling upgrades host by host, shutting down the component, upgrading packages, restarting the service, and validating service health before moving on to the next one.

Requirements and considerations

  • Confluent Ansible only supports migration over the same Confluent Platform version of 7.6 or later.

    Note that migrating Confluent Platform 7.6.0 clusters is not recommended for production environments.

  • You need to upgrade Confluent Platform first before running the migration.

    You cannot upgrade of Confluent Platform version and migrate ZooKeeper to KRaft at the same time.

  • You can upgrade from ZooKeeper to KRaft in isolated mode (the controller having process.roles=controller and broker having process.roles=broker)

    You cannot migrate to the combined mode where KRaft and brokers are on the same process (role=controller, broker).

  • You can migrate colocated clusters where ZooKeeper and brokers are on the same node.

  • You can migrate ZooKeeper cluster to KRaft running on the same node.

    Beware of port collisions if colocating components on the same host. If ZooKeeper and KRaft Controller are colocated, use the variables kafka_controller_jolokia_port and kafka_controller_jmxexporter_port to define different ports for ZooKeeper and KRaft. For example, kafka_controller_jolokia_port: 7777 and kafka_controller_jmxexporter_port: 8081.

  • ACL is migrated from ZooKeeper to KRaft.

  • Confluent Ansible supports one-to-many or many-to-one mapping of ZooKeeper to KRaft controllers where the number of ZooKeeper nodes differs from the number of controller nodes.

  • You cannot enable ZooKeeper migration when multiple log directories (JBOD) are in use by the brokers.

  • Confluent Ansible supports migration with the same cluster configurations.

    • Different security protocols on the ZooKeeper cluster and the KRaft cluster are not recommended in migration.

    • You cannot change the number of KRaft controller nodes or change the KRaft node IP, hostname, or ports of the controllers after migration.

Migrate to KRaft

To migrate a ZooKeeper-based Confluent Platform to KRaft:

  1. Enable the migration flag in the same inventory file you used for ZooKeeper cluster setup:

    all:
      vars:
        kraft_migration: true
    
  2. Add the kafka_controller host to your inventory file. For example:

    kafka_controller:
      hosts:
        ip1.us-west-2.compute.amazonaws.com:
    
    zookeeper:
      hosts:
        ip2.us-west-2.compute.amazonaws.com:
    
    kafka_broker:
      hosts:
        ip3.us-west-2.compute.amazonaws.com:
    
  3. Run the migration playbook. You have two options:

    • Migrate in two steps with validation in between

      This is the recommended way because rollback can only be done till the cluster is in the Dual Write mode. This workflow allows you to stop to ensure migration completion before moving to complete the KRaft state.

      Step 1. Migrate to the Dual Write mode:

      ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
        --tags migrate_to_dual_write
      

      Step 2. Validate that all data has been migrated without any loss.

      Step 3. Complete the migration to the KRaft mode:

      ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
        --tags migrate_to_kraft
      
    • Migrate in one step

      If you want to migrate in one step without pausing at the Dual Write mode, run the following command without tags.

      ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml
      
  4. Once the cluster is running in KRaft mode, you can stop your ZooKeeper if the ZooKeeper is not managing multiple Kafka clusters.

  5. Remove the ZooKeeper section and the migration flag (set in the first step above) from your inventory file.

Roll back to ZooKeeper

If the migration runs into problems, you can roll back to ZooKeeper mode.

Important

You can roll back only before the cluster is finalized into KRaft mode. After you take the controllers out of migration mode and restart in KRaft mode (the FINALIZED state), the rollback is not possible.

Use the following tags depending on how far you want to roll back:

  • To revert the brokers from KRaft to HYBRID_DUAL_WRITE mode while the KRaft controllers stay active, use the rollback_to_hybrid tag. Use this tag when the cluster is in the PURE_DUAL_WRITE state and you want a partial rollback to the hybrid mode:

    ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
      --tags rollback_to_hybrid
    
  • To fully revert the cluster to pure ZooKeeper mode, use the rollback_to_premigration tag:

    ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
      --tags rollback_to_premigration
    

Roll back manually

If you prefer to roll back to ZooKeeper manually, complete the following steps while the cluster is still in a dual-write state:

  1. For each KRaft broker:

    1. Take each KRaft broker down.

    2. Remove the __cluster_metadata directory on the broker.

    3. Restart the broker in ZooKeeper mode.

  2. Perform a clean shutdown of the KRaft controller quorum.

    A clean shutdown of the KRaft quorum is important because there may be uncommitted metadata waiting to be written to ZooKeeper. A forceful shutdown could cause some metadata to be lost.

  3. Using the ZooKeeper shell, delete the /controller node so that a ZooKeeper-based broker can become the next controller.

Troubleshoot migration issues

This section describes a few of the potential issues you might encounter while migrating ZooKeeper to KRaft and presents the steps to troubleshoot the issues.

Migration failed with the error:
{"attempts": 10, "cache_control": "no-cache", "changed": false,
"content_type": "text/plain; charset=utf-8", "cookies": {}, "cookies_string":
"", "date": "Fri, 19 Jan 2024 11:12:14 GMT", "elapsed": 0, "expires": "Fri,
19 Jan 2024 10:12:14 GMT", "json": {"request": {"mbean":
"kafka.controller:name=ZkMigrationState,type=KafkaController", "type":
"read"}, "status": 200, "timestamp": 1705662734, "value": {"Value": 2}},
"msg": "OK (unknown bytes)", "pragma": "no-cache", "redirected": false,
"status": 200, "transfer_encoding": "chunked", "url":
"https://localhost:7770/jolokia/read/kafka.controller:type=KafkaController,name=ZkMigrationState"}

Solution: Increase the metadata_migration_retries value. Due to the size of the cluster, it might be taking more time to migrate than expected.

Migration failed with the error:
{"msg": "The conditional check '( jolokia_output.content | from_json
).value.Value == 1' failed. The error was: Expecting value: line 1 column 1
(char 0)"}

One of the following can cause the error:

  • The KRaft controller has failed. For details of the failure, see server.logs of the KRaft controller.

  • Jolokia is disabled in the the KRaft controller.

Solution: Enable Jolokia in the KRaft controller if needed. Or review and address the issue in server.log.

Migration failed with the error:
{"msg": "The conditional check '( jolokia_output.content | from_json
).value.Value == 1' failed. The error was: error while evaluating
conditional (( jolokia_output.content | from_json ).value.Value == 1):
'dict object' has no attribute 'value'"}

Confluent Ansible playbooks are using an older version of Confluent Platform with confluent_package_version set to 7.5 or earlier.

Solution: Use Confluent Platform 7.6 or later.

Migration failed with an authorization error on an RBAC cluster.

When migrating an RBAC cluster, the principal for the controller should be a super user on the broker, and the broker’s principal should be a super user on the KRaft controller.

Solution: Use super users on the KRaft controller as the principals for the ZooKeeper broker.