Troubleshoot Ansible Playbooks for Confluent Platform¶

Complete the following steps if Ansible Playbooks for Confluent Platform (Confluent Ansible) fails:

Review the error log output from Ansible itself.

It will show the type of failure which has occurred and might indicate a misconfiguration in your inventory file. For example, if you set a file path variable to an invalid path, the logs will say “Could not find or access “ the file path, and you need to correct the variable and rerun the install.
Review your inventory file.

Validate that all variables are set correctly, with proper spacing in the inventory file. You can review hosts_example.yml and the sample_inventories directory for examples.
Review component log and property files.

If a component health check fails, the playbook will fetch log and property files back to the Ansible Control Node inside the error_files/ directory. These error logs can indicate which properties are misconfigured.
- If Confluent Ansible was downloaded from Ansible Galaxy or Ansible Automation Hub, the error_files/ directory is under ~/.ansible/collections/ansible_collections/confluent/platform/.
- If Confluent Ansible was downloaded from GitHub, the error_files/ directory is located under the root of cp-ansible.
If the log files do not provide a clear reason for the failure, use one of the following methods to generate more info:
- Rerun the playbook with the --diff option and redirect the output to a file. For more information about the flag, see Ansible Playbook Options.
  
  This outputs the differences in the playbook files and templates. With this option, sensitive information, such as passwords, certificates, and keys, are not printed in the output.
- Rerun the playbook again with the -vvv option to enable debug logging and redirect the output to a file:
```
ansible-playbook -vvv -i hosts.yml confluent.platform.all > failure.txt
```
  When debug is enabled, the information in the output cannot be suppressed, including sensitive information, such as passwords, certificates, and keys. It is not recommended to use the debug mode in production environments. For details, see Logging Ansible Output.
- Generate logs using the fetch-logs playbook.
Open a support ticket with Confluent Support and provide the following within a compressed archive file:
- Your inventory file
- The log files generated from the -vvv or --diff option.
- The error_logs/ directory and its contents
- The output of the following GIT commands as a text file. Run the command from the root of cp-ansible to show any changes made to the cp-ansible source code:
```
git status
```
```
git diff
```

Generate logs using the fetch-logs playbook¶

When troubleshooting, you might need to collect the service logs and config files of all Confluent components. Instead of having to ssh to each host machine and fetch the logs/files, you can run the fetch_logs playbook to get all service logs/config files in a single directory on the control node.

The playbook stores the gathered log files in a separate zip file for each component. The zip files are located in:

~/.ansible/collections/ansible_collections/confluent/platform/playbooks/troubleshooting if Confluent Ansible was downloaded from Ansible Galaxy or Ansible Automation Hub
<the root of cp-ansible>/playbooks/troubleshooting if Confluent Ansible was downloaded from GitHub <cp-ansible-directory>

To gather logs and config files of all components:

ansible-playbook -i hosts.yml confluent.platform.fetch_logs

To gather service logs and config files of a specific component use the --tags flag with the component name. For example, to get logs and config files for Kafka:

ansible-playbook -i hosts.yml confluent.platform.fetch_logs --tags 'kafka_broker'

Troubleshoot known issues¶

Issue: An error, “Clusters not found”, returns after Kafka brokers restart¶

If keys get updated on Kafka brokers or any other Confluent Platform component, the communication between component services and the brokers would get broken.

Solution: Regenerate certificates when you are updating the keys, during an update or a redeployment of the cluster.

To regenerate certificates along with keys in your inventory file:
```
regenerate_ca: true
```
To only update the certificates whe keys have been already generated:
```
regenerate_ca: true

regenerate_keystore_and_truststore: false
```

Issue: Need to enable both mTLS and SASL authentication modes for ZooKeeper¶

There is a limitation that Confluent Ansible does not support both mTLS and SASL authentication modes for ZooKeeper out-of-the-box.

Workaround: Configure the mTLS and SASL authentication modes using an override with the following properties in your inventory file:

zookeeper_sasl_protocol: scram
ssl_enabled: true # This will generate the needed x509 provider authProvider.x509=org.apache.zookeeper.server.auth.X509AuthenticationProvider

zookeeper_client_authentication_type: digest # This will generate the needed SASL provider authProvider.sasl=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
zookeeper_custom_properties:
    ssl.clientAuth: need  # This will force mTLS for client connecting to Zookeeper

# The following overrides are needed for the brokers to connect to Zookeeper using mTLS
kafka_broker_custom_properties:
    zookeeper.ssl.keystore.location: /var/ssl/private/kafka_broker.keystore.jks
    zookeeper.ssl.keystore.password: confluentkeystorestorepass

Issue: Python version mismatch¶

If you are using different versions of Python across the nodes, for example, the Ansible control node has python 2.7 installed while the target nodes has Python 3 set as default python, you may hit the following error:

The Python 2 bindings for rpm are needed for this module. If you require
Python 3 support use the `dnf` Ansible module instead. The Python 2 yum
module is needed for this module. If you require Python 3 support use the
`dnf` Ansible module instead.

Solution: Install the same, recommended Python version on all of your control nodes and managed nodes.

Issue: Missing Ansible POSIX collection¶

If the required Ansible POSIX collection is missing in your environment, you will get an error similar to:

ERROR! couldn't resolve module/action 'sysctl'

Solution: Install the Ansible POSIX collection.

ansible-galaxy collection install ansible.posix

Issue: Incorrect Ansible hash behavior¶

If the default Ansible hash behavior is not set to MERGE, you will get an error similar to:

TASK [confluent.platform.common : Confirm Hash Merging Enabled]

fatal: [ip-10-0-2-212.us-west-2.compute.internal]: FAILED! => {
    "assertion": "lookup('config', 'DEFAULT_HASH_BEHAVIOUR') == 'merge'",
    "changed": false,
    "evaluated_to": false,
    "msg": "Hash Merging must be enabled in ansible.cfg"
}

Solution: Set the Ansible hash behavior to merge.

export ANSIBLE_HASH_BEHAVIOUR=merge

Issue: Missing Ansible community general collection¶

If the required Ansible community general collection is missing in your environment, you will get an error similar to:

TASK [confluent.platform.common : Custom Java Install]

ERROR! couldn't resolve module/action 'alternatives'.

Solution: Install the Ansible community general collection.

ansible-galaxy collection install community.general

Issue: Missing Ansible community crypto collection¶

If the required Ansible community crypto collection is missing in your environment, you will get an error similar to:

TASK [confluent.platform.ssl : Create Keystore and Truststore with Self Signed Certs]

ERROR! couldn't resolve module/action 'community.crypto.certificate_complete_chain'

Solution: Install the Ansible community crypto collection.

ansible-galaxy collection install community.crypto

Issue: Corrupted master key¶

When your master key is corrupted, you get an error message similar to the following:

TASK [confluent.platform.common : Encrypt Properties] **************************
task path: /root/.ansible/collections/ansible_collections/confluent/platform/roles/common/tasks/secrets_protection.yml

Error! failed to unwrap the data key: invalid master key or corrupted data key

Solution: Let Confluent Ansible recreate the master key.

Remove the variables, regenerate_masterkey, secrets_protection_masterkey, and secrets_protection_security_file from your inventory file.

Run the following command:

ansible-playbook -i <inventory.yml> confluent.platform.all --skip-tags package

Issue: An error while setting up KRaft-based cluster with mTLS RBAC and custom user¶

When setting up a KRaft-based mTLS RBAC cluster with custom user (greenfield setup or a migration), you might received the following error:

ERROR Exiting Kafka due to fatal exception (kafka.Kafka$)
java.nio.file.AccessDeniedException: /etc/controller/server.properties
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)

Workaround: Define kafka_controller_user and kafka_controller_group variables under all and kafka_broker in your inventory file if it is currently defined in the kraft_controller section only.

all:
  kafka_controller_user: <user>
  kafka_controller_group: <group>

kafka_broker:
  kafka_controller_user: <user>
  kafka_controller_group: <group>