Troubleshoot Ansible Playbooks for Confluent Platform¶
Complete the following steps if Ansible Playbooks for Confluent Platform (Confluent Ansible) fails:
Review the error log output from Ansible itself.
It will show the type of failure which has occurred and might indicate a misconfiguration in your inventory file. For example, if you set a file path variable to an invalid path, the logs will say “Could not find or access “ the file path, and you need to correct the variable and rerun the install.
Review your inventory file.
Validate that all variables are set correctly, with proper spacing in the inventory file. You can review
hosts_example.yml
and thesample_inventories
directory for examples.Review component log and property files.
If a component health check fails, the playbook will fetch log and property files back to the Ansible Control Node inside the
error_files/
directory. These error logs can indicate which properties are misconfigured.- If Confluent Ansible was downloaded from Ansbile Galaxy or Ansible
Automation Hub, the
error_files/
directory is under~/.ansible/collections/ansible_collections/confluent/platform/
. - If Confluent Ansible was downloaded from GitHub, the
error_files/
directory is located under the root ofcp-ansible
.
- If Confluent Ansible was downloaded from Ansbile Galaxy or Ansible
Automation Hub, the
If the log files do not provide a clear reason for the failure, use one of the following methods to generate more info:
Rerun the playbook with the
--diff
option and redirect the output to a file. For more information about the flag, see Ansible Playbook Options.This outputs the differences in the playbook files and templates. With this option, sensitive information, such as passwords, certificates, and keys, are not printed in the output.
Rerun the playbook again with the
-vvv
option to enable debug logging and redirect the output to a file:ansible-playbook -vvv -i hosts.yml confluent.platform.all > failure.txt
When debug is enabled, the information in the output cannot be suppressed, including sensitive information, such as passwords, certificates, and keys. It is not recommended to use the debug mode in production environments. For details, see Logging Ansible Output.
Open a support ticket with Confluent Support and provide the following within a compressed archive file:
Your inventory file
The log files generated from the
-vvv
or--diff
option.The
error_logs/
directory and its contentsThe output of the following GIT commands as a text file. Run the command from the root of
cp-ansible
to show any changes made to thecp-ansible
source code:git status
git diff
Generate logs using the fetch-logs playbook¶
When troubleshooting, you might need to collect the service logs and config
files of all Confluent components. Instead of having to ssh
to each host
machine and fetch the logs/files, you can run the fetch_logs
playbook to
get all service logs/config files in a single directory on the control node.
The playbook stores the gathered log files in a separate zip file for each component. The zip files are located in:
~/.ansible/collections/ansible_collections/confluent/platform/playbooks/troubleshooting
if Confluent Ansible was downloaded from Ansbile Galaxy or Ansible Automation Hub<the root of cp-ansible>/playbooks/troubleshooting
if Confluent Ansible was downloaded from GitHub<cp-ansible-directory>
To gather logs and config files of all components:
ansible-playbook -i hosts.yml confluent.platform.fetch_logs
To gather service logs and config files of a specific component use the
--tags
flag with the component name. For example, to get logs and config
files for Kafka:
ansible-playbook -i hosts.yml confluent.platform.fetch_logs --tags 'kafka_broker'
Troubleshoot known issues¶
Issue: An error, “Clusters not found”, returns after Kafka brokers restart¶
If keys get updated on Kafka brokers or any other Confluent Platform component, the communication between component services and the brokers would get broken.
Solution: Regenerate certificates when you are updating the keys, during an update or a redeployment of the cluster.
To regenerate certificates along with keys in your inventory file:
regenerate_ca: true
To only update the certificates whe keys have been already generated:
regenerate_ca: true regenerate_keystore_and_truststore: false
Issue: Need to enable both mTLS and SASL authentication modes for ZooKeeper¶
There is a limitation that Confluent Ansible does not support both mTLS and SASL authentication modes for ZooKeeper out-of-the-box.
Workaround: Configure the mTLS and SASL authentication modes using an override with the following properties in your inventory file:
zookeeper_sasl_protocol: scram
ssl_enabled: true # This will generate the needed x509 provider authProvider.x509=org.apache.zookeeper.server.auth.X509AuthenticationProvider
zookeeper_client_authentication_type: digest # This will generate the needed SASL provider authProvider.sasl=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
zookeeper_custom_properties:
ssl.clientAuth: need # This will force mTLS for client connecting to Zookeeper
# The following overrides are needed for the brokers to connect to Zookeeper using mTLS
kafka_broker_custom_properties:
zookeeper.ssl.keystore.location: /var/ssl/private/kafka_broker.keystore.jks
zookeeper.ssl.keystore.password: confluentkeystorestorepass
Issue: Python version mismatch¶
If you are using different versions of Python across the nodes, for example, the Ansible control node has python 2.7 installed while the target nodes has Python 3 set as default python, you may hit the following error:
The Python 2 bindings for rpm are needed for this module. If you require
Python 3 support use the `dnf` Ansible module instead. The Python 2 yum
module is needed for this module. If you require Python 3 support use the
`dnf` Ansible module instead.
Solution: Install the same, recommended Python version on all of your control nodes and managed nodes.
Issue: Unable to connect to Connect Replicator service¶
When installing Confluent Platform using the archive installation mode (installation_method:
archive
), you may encounter an issue java.lang.NoSuchMethodError
while
connecting to kafka_connect_replicator_port
(default_value of 8083).
Solution:
Modify the Replicator startup script as described in the Confluent Support knowledge base article.
After you have modified the script, restart your replicator.
systemctl restart kafka-connect-replicator.service