Troubleshoot Ansible Playbooks for Confluent Platform

Complete the following steps if Ansible Playbooks for Confluent Platform (Confluent Ansible) fails:

  1. Review the error log output from Ansible itself.

    It will show the type of failure which has occurred and might indicate a misconfiguration in your inventory file. For example, if you set a file path variable to an invalid path, the logs will say “Could not find or access “ the file path, and you need to correct the variable and rerun the install.

  2. Review your inventory file.

    Validate that all variables are set correctly, with proper spacing in the inventory file. You can review hosts_example.yml and the sample_inventories directory for examples.

  3. Review component log and property files.

    If a component health check fails, the playbook will fetch log and property files back to the Ansible Control Node inside the error_files/ directory. These error logs can indicate which properties are misconfigured.

    • If Confluent Ansible was downloaded from Ansbile Galaxy or Ansible Automation Hub, the error_files/ directory is under ~/.ansible/collections/ansible_collections/confluent/platform/.
    • If Confluent Ansible was downloaded from GitHub, the error_files/ directory is located under the root of cp-ansible.
  4. If the log files do not provide a clear reason for the failure, use one of the following methods to generate more info:

    • Rerun the playbook with the --diff option and redirect the output to a file. For more information about the flag, see Ansible Playbook Options.

      This outputs the differences in the playbook files and templates. With this option, sensitive information, such as passwords, certificates, and keys, are not printed in the output.

    • Rerun the playbook again with the -vvv option to enable debug logging and redirect the output to a file:

      ansible-playbook -vvv -i hosts.yml confluent.platform.all > failure.txt
      

      When debug is enabled, the information in the output cannot be suppressed, including sensitive information, such as passwords, certificates, and keys. It is not recommended to use the debug mode in production environments. For details, see Logging Ansible Output.

    • Generate logs using the fetch-logs playbook.

  5. Open a support ticket with Confluent Support and provide the following within a compressed archive file:

    • Your inventory file

    • The log files generated from the -vvv or --diff option.

    • The error_logs/ directory and its contents

    • The output of the following GIT commands as a text file. Run the command from the root of cp-ansible to show any changes made to the cp-ansible source code:

      git status
      
      git diff
      

Generate logs using the fetch-logs playbook

When troubleshooting, you might need to collect the service logs and config files of all Confluent components. Instead of having to ssh to each host machine and fetch the logs/files, you can run the fetch_logs playbook to get all service logs/config files in a single directory on the control node.

The playbook stores the gathered log files in a separate zip file for each component. The zip files are located in:

  • ~/.ansible/collections/ansible_collections/confluent/platform/playbooks/troubleshooting if Confluent Ansible was downloaded from Ansbile Galaxy or Ansible Automation Hub
  • <the root of cp-ansible>/playbooks/troubleshooting if Confluent Ansible was downloaded from GitHub <cp-ansible-directory>

To gather logs and config files of all components:

ansible-playbook -i hosts.yml confluent.platform.fetch_logs

To gather service logs and config files of a specific component use the --tags flag with the component name. For example, to get logs and config files for Kafka:

ansible-playbook -i hosts.yml confluent.platform.fetch_logs --tags 'kafka_broker'

Troubleshoot known issues

Issue: An error, “Clusters not found”, returns after Kafka brokers restart

If keys get updated on Kafka brokers or any other Confluent Platform component, the communication between component services and the brokers would get broken.

Solution: Regenerate certificates when you are updating the keys, during an update or a redeployment of the cluster.

  • To regenerate certificates along with keys in your inventory file:

    regenerate_ca: true
    
  • To only update the certificates whe keys have been already generated:

    regenerate_ca: true
    
    regenerate_keystore_and_truststore: false
    

Issue: Need to enable both mTLS and SASL authentication modes for ZooKeeper

There is a limitation that Confluent Ansible does not support both mTLS and SASL authentication modes for ZooKeeper out-of-the-box.

Workaround: Configure the mTLS and SASL authentication modes using an override with the following properties in your inventory file:

zookeeper_sasl_protocol: scram
ssl_enabled: true # This will generate the needed x509 provider authProvider.x509=org.apache.zookeeper.server.auth.X509AuthenticationProvider

zookeeper_client_authentication_type: digest # This will generate the needed SASL provider authProvider.sasl=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
zookeeper_custom_properties:
    ssl.clientAuth: need  # This will force mTLS for client connecting to Zookeeper

# The following overrides are needed for the brokers to connect to Zookeeper using mTLS
kafka_broker_custom_properties:
    zookeeper.ssl.keystore.location: /var/ssl/private/kafka_broker.keystore.jks
    zookeeper.ssl.keystore.password: confluentkeystorestorepass

Issue: Python version mismatch

If you are using different versions of Python across the nodes, for example, the Ansible control node has python 2.7 installed while the target nodes has Python 3 set as default python, you may hit the following error:

The Python 2 bindings for rpm are needed for this module. If you require
Python 3 support use the `dnf` Ansible module instead. The Python 2 yum
module is needed for this module. If you require Python 3 support use the
`dnf` Ansible module instead.

Solution: Install the same, recommended Python version on all of your control nodes and managed nodes.

Issue: Unable to connect to Connect Replicator service

When installing Confluent Platform using the archive installation mode (installation_method: archive), you may encounter an issue java.lang.NoSuchMethodError while connecting to kafka_connect_replicator_port (default_value of 8083).

Solution:

  1. Modify the Replicator startup script as described in the Confluent Support knowledge base article.

  2. After you have modified the script, restart your replicator.

    systemctl restart kafka-connect-replicator.service