Migrate Confluent Platform from ZooKeeper to KRaft using Ansible Playbooks

Ansible Playbooks for Confluent Platform (Confluent Ansible) supports migration from a ZooKeeper-based Confluent Platform deployment to a KRaft-based deployment.

To safely migrate your hosts and to achieve zero downtime, Confluent Ansible performs rolling upgrades host by host, shutting down the component, upgrading packages, restarting the service, and validating service health before moving on to the next one.

Requirements and considerations

  • Confluent Ansible only supports migration over the same Confluent Platform version of 7.6 or later.

    Note that migrating Confluent Platform 7.6.0 clusters is not recommended for production environments.

  • You need to upgrade Confluent Platform first before running the migration.

    You cannot upgrade of Confluent Platform version and migrate ZooKeeper to KRaft at the same time.

  • You can upgrade from ZooKeeper to KRaft in isolated mode (the controller having process.roles=controller and broker having process.roles=broker)

    You cannot migrate to the combined mode where KRaft and brokers are on the same process (role=controller, broker).

  • You can migrate co-located clusters where ZooKeeper and brokers are on the same node.

  • You can migrate ZooKeeper cluster to KRaft running on the same node.

    Beware of port collisions if co-locating components on the same host. If ZooKeeper and KRaft Controller are co-located, use the variables kafka_controller_jolokia_port and kafka_controller_jmxexporter_port to define different ports for ZooKeeper and KRaft. For example, kafka_controller_jolokia_port: 7777 and kafka_controller_jmxexporter_port: 8081.

  • ACL is migrated from ZooKeeper to KRaft.

  • Confluent Ansible supports one-to-many or many-to-one mapping of ZooKeeper to KRaft controllers where the number of ZooKeeper nodes differs from the number of controller nodes.

  • You cannot enable ZooKeeper migration when multiple log directories (JBOD) are in use by the brokers.

  • Confluent Ansible supports migration with the same cluster configurations.

    • Different security protocols on the ZooKeeper cluster and the KRaft cluster are not recommended in migration.
    • You cannot change the number of KRaft controller nodes or change the KRaft node IP, hostname, or ports of the controllers after migration.

Migrate to KRaft

To migrate a ZooKeeper-based Confluent Platform to KRaft:

  1. Enable the migration flag in the same inventory file you used for ZooKeeper cluster setup:

    all:
      vars:
        kraft_migration: true
    
  2. Add the kafka_controller host to your inventory file. For example:

    kafka_controller:
      hosts:
        ip1.us-west-2.compute.amazonaws.com:
    
    zookeeper:
      hosts:
        ip2.us-west-2.compute.amazonaws.com:
    
    kafka_broker:
      hosts:
        ip3.us-west-2.compute.amazonaws.com:
    
  3. If migrating a cluster that is configured with SASL/SCRAM, set the controller-to-controller authentication method using the kafka_controller_sasl_protocol variable in the hosts.yml inventory file.

    For example:

    all:
      vars:
        sasl_protocol: scram
    
    kafka_controller:
      vars:
        kafka_controller_sasl_protocol: plain,scram
    

    For details and example usage of the variable, see Configure SASL/SCRAM authentication.

  4. Run the migration playbook. You have two options:

    • Migrate in two steps with validation in between

      This is the recommended way because rollback can only be done till the cluster is in the Dual Write mode. This workflow allows you to stop to ensure migration completion before moving to complete the KRaft state.

      Step 1. Migrate to the Dual Write mode:

      ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
        --tags migrate_to_dual_write
      

      Validation step. Check and verify that all data has been migrated without any loss.

      Step 2. Complete the migration to the KRaft mode:

      ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml \
        --tags migrate_to_kraft
      
    • Migrate in one step

      If you want to migrate in one step without pausing at the Dual Write mode, run the following command without tags.

      ansible-playbook -i <inventory-file> confluent.platform.ZKtoKraftMigration.yml
      
  5. Once the cluster is running in KRaft mode, you can stop your ZooKeeper if the ZooKeeper is not managing multiple Kafka clusters.

  6. Remove the ZooKeeper section and the migration flag (set in the first step above) from your inventory file.

Roll back to ZooKeeper

If the migration fails, you can roll back to the ZooKeeper cluster at any point in the migration process prior to taking the KRaft controllers out of the migration mode. Up to that point, the controller makes dual writes to KRaft and ZooKeeper. Since the data in ZooKeeper is still consistent with that of the KRaft metadata log, it is still possible to revert to ZooKeeper.

Once you take the controller out of the migration mode and restart in KRaft mode, you can no longer roll back to ZooKeeper mode.

To roll back to ZooKeeper:

  1. For each KRaft broker:

    1. Take each KRaft broker down.
    2. Remove the __cluster_metadata directory on the broker.
    3. Restart the broker in ZooKeeper mode.
  2. Perform a clean shutdown of the KRaft controller quorum.

    A clean shutdown of the KRaft quorum is important because there may be uncommitted metadata waiting to be written to ZooKeeper. A forceful shutdown could cause some metadata to be lost.

  3. Using the ZooKeeper shell, delete the /controller and /controller_epoch nodes so that a ZooKeeper-based broker can become the next controller.

Troubleshoot migration issues

This section describes a few of the potential issues you might encounter while migrating ZooKeeper to KRaft and presents the steps to troubleshoot the issues.

Migration failed with the error:
{"attempts": 10, "cache_control": "no-cache", "changed": false,
"content_type": "text/plain; charset=utf-8", "cookies": {}, "cookies_string":
"", "date": "Fri, 19 Jan 2024 11:12:14 GMT", "elapsed": 0, "expires": "Fri,
19 Jan 2024 10:12:14 GMT", "json": {"request": {"mbean":
"kafka.controller:name=ZkMigrationState,type=KafkaController", "type":
"read"}, "status": 200, "timestamp": 1705662734, "value": {"Value": 2}},
"msg": "OK (unknown bytes)", "pragma": "no-cache", "redirected": false,
"status": 200, "transfer_encoding": "chunked", "url":
"https://localhost:7770/jolokia/read/kafka.controller:type=KafkaController,name=ZkMigrationState"}

Solution: Increase the metadata_migration_retries value. Due to the size of the cluster, it might be taking more time to migrate than expected.

Migration failed with the error:
{"msg": "The conditional check '( jolokia_output.content | from_json
).value.Value == 1' failed. The error was: Expecting value: line 1 column 1
(char 0)"}

One of the following can cause the error:

  • The KRaft controller has failed. For details of the failure, see server.logs of the KRaft controller.
  • Jolokia is disabled in the the KRaft controller.

Solution: Enable Jolokia in the KRaft controller if needed. Or review and address the issue in server.log.

Migration failed with the error:
{"msg": "The conditional check '( jolokia_output.content | from_json
).value.Value == 1' failed. The error was: error while evaluating
conditional (( jolokia_output.content | from_json ).value.Value == 1):
'dict object' has no attribute 'value'"}

Confluent Ansible playbooks are using an older version of Confluent Platform with confluent_package_version set to 7.5 or earlier.

Solution: Use Confluent Platform 7.6 or later.

Migration failed with an authorization error on an RBAC cluster.

When migrating an RBAC cluster, the principal for the controller should be a super user on the broker, and the broker’s principal should be a super user on the KRaft controller.

Solution: Use super users on the KRaft controller as the principals for the ZooKeeper broker.