Generate Diagnostics with the Diagnostics Bundle Tool for Confluent Platform

The Confluent Platform Diagnostics Bundle Tool is a tool that collects diagnostic information about your Confluent Platform installation and compresses it into a .tar.gz file. The file can be uploaded to Confluent Support for further analysis. This tool currently collects diagnostics on Kafka brokers and Kafka Connect.

You can perform the following tasks with this tool:

Collect diagnostics to submit to Confluent
Collect logs from a specific time frame
Collect diagnostics with an input file
Evaluate the components and modify the diagnostics that are generated

After you collect diagnostics, you can upload the file to Confluent Support.

Prerequisites

To run this tool you need the following:

Confluent Platform version 6.1 or later
Java 8 or later (also required by Confluent Platform)
Java Virtual Machine Process Status (jps) tool installed to Discover and modify components
Permission to write to the current directory
Read access to files being collected (for complete collection)

Installation

To install the tool:

Download the Diagnostics Bundle Tool jar file from the Confluent download page.
For example, you can use the wget or curl commands to download. You should download the latest version of the tool from the download page. Example commands to download version 1.0.3 follow.
Example using wget. Check for the latest version and change the version string if needed.
```
wget https://packages.confluent.io/tools/diagnostics-bundle/diagnostics-bundle-1.0.3.jar
```
Example using curl. Check for the latest version and change the version string if needed.
```
curl -O https://packages.confluent.io/tools/diagnostics-bundle/diagnostics-bundle-1.0.3.jar
```
Copy the JAR file to a directory on each Confluent Platform node where you want to run it and collect diagnostics.

For release details, see the Release notes section.

Collect diagnostics

To collect diagnostics for Confluent Platform to share with Confluent, use the collect command on each node where Confluent Platform is also running. Using collect with no options results in the tool performing basic sanitization of the output data, meaning password strings and MAC addresses are redacted, and log files from seven days previous to the current time are collected.

You can also specify specific logs, use the Log Redaction Tool or the the discover and plan commands to help sanitize the output.

Important

You should review all information in the output file before you upload the file to Confluent support.

Following is an example of the collect command:

java -jar ./diagnostics-bundle-<version>.jar collect

The tool will run for a couple of minutes and when complete, generates a zipped output bundle named like the following: diagnostics-output-<hostname>-<YYYY>-<MM>-<DD>-<HH>-<MM>-<ss>.tar.gz. This bundle contains:

Confluent components logs
Confluent component configuration files
Confluent component process information
Confluent component metrics if JMX is enabled
Host information

Your final line of output should resemble:

Diagnostics output has been zipped and written to: /home/confluent/diagnostics-output-PF2T6DCF-2023-08-22-23-38-55.tar.gz

This bundle contains a metadata directory and a directory for each Confluent component. The metadata directory contains input files, if they were specified, and other metadata for the tool. Each component directory contains a subdirectory and files for each diagnostic that was collected. Following is example of the file structure of the tool output:

└── diagnostics-output-ip-10-0-206-212-2023-08-24-12-38-22
    ├── _meta
    │   ├── component-plan.yaml
    │   └── diagnostics.log
    |   └── discovered-components.yaml
    ├── host
    │   └── shell
    │       └── mpstat-P-ALL-1-10.yaml
    └── kafka
        ├── logs
        │   └── var-log-kafka
        │       ├── controller.log
        │       └── data-balancer.log
        ├── metrics
        │   └── metrics.txt
        ├── properties
        │   ├── log4j.properties
        │   └── server.properties
        └── shell
             └── du-var-lib-kafka-data.yaml

Important

Do not upload event data or message content to support tickets. See Upload the files.

Collect logs from a specific time frame

The collect command has the following options to specify the log files that are collected:

Option	Default	Details
`--all-logs`	False	Collects all logs related to all Confluent components. This option overrides the other options.
`--logs-start=<startTimestamp>` -	7 days ago	Collect log files created before the specified timestamp. The timestamp must be specified in ISO-8601 format. Example: 2023-09-01T00:00:00Z
`--logs-end=<endTimeStamp>`	The current time	Collect log files modified after the specified timestamp. The timestamp must be specified in ISO-8601 format. Example: 2023-09-01T00:00:00Z

Collect diagnostics with an input file

You can control what diagnostics are generated by specifying an input file when you run the collect command.

You can use one of the following switches with the collect command to specify the diagnostics that are generated:

--from-discover: Provide a custom discover file to specify the components that are evaluated for diagnostics. For more information about the discover file, see Discover and modify components.
--from-plan: Enables a customer plan file to specify what is included in the diagnostics. For more information about the plan file, see Understand and modify the plan file.
--from-config: Enables you to specify a config file for a component that is not currently running. To use this command, provide a configuration file for a Confluent Platform component. For more information about the config file, see Use a config file.

Following is an example of the --from-plan switch.

java -jar "diagnostics-bundle-<version>.jar" collect --from-plan planfile.yaml

Evaluate and modify the diagnostics output

You can evaluate and modify the diagnostics that are output by the Diagnostics Bundle Tool by creating and/or modifying the list of components and diagnostics for those components.

Understand and modify the plan file

You can optionally use the plan command to see the files that are examined and diagnostics that are generated by the tool. This output from this command also shows you what is excluded; passwords and other sensitive data. Once you have generated this data, you can save it, modify it, and specify the file as input when you generate diagnostics.

The following example shows you how to run this command on the node that is running Confluent Platform:

java -jar "diagnostics-bundle-<version>.jar" plan

Following is example output (YAML) from the tool:

components:
 - type: kafka
   processId: 21189
   diagnostics:
     - type: shell
       timeoutInSeconds: 35
       commands:
         - jinfo 21189
         - top -n 10 -b -p 21189
       excludedKeywords:
         - password
         - secret
         - credential
         - auth
         - token
         - key
         # So that MAC address is not collected.
         - ^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$
     - type: logs
       logDirectories:
         # The log files in these directories will be collected.
         - /var/log/kafka
       options:
         startTimestamp: 2023-08-17T05:35:04Z
         endTimestamp: 2023-08-24T05:35:04Z
     - type: metrics
       jmxPort: 9999
       jmxHost: localhost
       pollingIntervalInSeconds: 6
       pollingIterations: 10
       include:
         # JVM
         - java.lang:type=OperatingSystem
         - java.lang:type=Memory
         # JMX
         - kafka.cluster:type=Partition,*
         - kafka.controller:type=ControllerStats,*
         - kafka.server:type=BrokerTopicMetrics,*
         - kafka.server:type=DelayedOperationPurgatory,*
       excludeMetricNames:
         - kafka.controller:name=ListPartitionReassignmentRateAndTimeMs,type=ControllerStats,*
       excludeMetricAttributes:
         - metricName: kafka.server:name=LogAppendLatencyMs,type=BrokerTopicMetrics
           attributes:
             - 50thPercentile
             - 75thPercentile
             - 95thPercentile
     - type: properties
       files:
         - type: componentConfigurationFile
           path: /etc/kafka/server.properties
         - type: log4jConfigurationFile
           path: /etc/kafka/log4j.properties
       # These properties will not be collected.
       exclude:
         - confluent.ssl.key.password
         - confluent.ssl.keystore.location
         - confluent.ssl.keystore.password
         - delegation.token.secret.key
         - password.encoder.secret
         - sasl.jaas.config

The file provides a list of components that diagnostics will be generated for. Each component has:

A type. Currently kafka, connect and host are supported
The process ID for the component (optional)
A list of diagnostics that will be collected

You can modify this file to meet the disclosure requirements of your organization. The following table describes each component type and its elements in more detail. The table indicates whether an element is required or optional.

To skip collecting diagnostics of a particular diagnostics type, omit the section in the file you pass to the to the collect command. For more information, see Collect diagnostics with an input file.

Diagnostics type	Element detail
`shell`	`timeoutInSeconds`: The timeout is enforced on each shell command. Defaults to 35 seconds. Must be a positive integer. Required. `commands`: List of shell commands to be executed. You can chain commands. `excludedKeywords`: Optional list of properties to exclude from report. Supports regular expressions.
`logs`	`logDirectories`: List of log directories of which files are collected recursively. You can also specify files, which are not recursive. You cannot use regular expressions. `options`: Optionally specifies the time period for which log files are collected with a `startTimestamp` and `endTimestamp`. If not specified, all files in the directories specified in `logDirectories` are collected. The default value for `startTimestamp` is 7 days ago and the default for `endTimestamp` is the current time. The timestamps accept ISO-8601 format. Examples: `2023-06-11T16:00:00Z` - Timestamp in UTC `2023-06-11T12:00:00-09` - Timestamp with time zone specified `2023-06-11T12:00:00` - Time zone without explicit time zone. Local time zone is used.
`metrics`	`jmxPort`: JMX port of the component. Required. `jmxHost`: JMX host of the component. Required. `pollingIterations`: Number of times metrics are collected. Default is 10. Must be a positive integer. Required. `pollingIntervalInSeconds`: Interval in seconds between each metric collection run. Default is 6. Must be a positive integer. Required. `include`: Optional list of ObjectName entries to collect. Supports * and ? wildcards. If not specified, all MBeans are collected. `excludeMetricNames`: Optional list ObjectName entries to not collect. Has precedence over `include`. `excludeMetricAttributes`: Optionally use when an object is the `include` list, but you want to specify specific attributes of the object to not be collected. Provide a list of ObjectName entries in canonical form. regular expressions are not supported.
`properties`	`files`: List of properties file objects for diagnostics. Each file contains an optional type and required path. Required. `exclude`: Optional list of properties to exclude from report. Supports regular expressions.

Discover and modify components

You can optionally use the discover command with the diagnostics tool to see what components will be evaluated. This command will discover components that are currently running and would be evaluated for diagnostic purposes. You can save this output, modify it, and specify the file as input when you generate the diagnostics for Confluent.

The following example shows you how to run this command on the node that is running Confluent Platform.

java -jar "diagnostics-bundle.jar" discover

Following is example output (YAML) from this command:

# Diagnostic Bundle found the following supported components on this node
components:
  - name: kafka
    processId: 21977
    log4jConfigurationFile: /etc/kafka/log4j.properties
    jmxPort: 9999
    jmxHost: localhost
    componentConfigurationFile: /etc/kafka/server.properties
    log4jDirectories:
      - /var/log/kafka
    dataDirectory: /var/lib/kafka/data

The following table describes each component type and its elements in more detail:

Component type	Elements detail
`name`	Type of the component for which diagnostics are collected. Currently supported: `kafka`, `connect`. Required.
`processId`	Optional process ID if the component is running.
`log4jConfigurationFile`	Optional path to the log4j configuration file.
`jmxPort`	Optional JMX port for the component.
`jmxHost`	Optional JMX host of the component. If not specified, defaults to `localhost`.
`componentConfigurationFile`	Optional path to the component configuration file.
`log4jDirectories`	Optional list of directories that contain the log4j log files to be collected.
`dataDirectory`	Optional directory containing Kafka log data.

Use a config file

You can specify a configuration file for the collect command that identifies components to be evaluated. The YAML file should contain the name of the component and list the paths to the configuration and log files for the component. The file must have the following format:

components:
  - name: kafka
    componentConfigurationFile: /etc/kafka/server.properties
    log4jConfigurationFile: /etc/kafka/log4j.properties
    log4jLogDirectories:
      - /etc/kafka

Element	Element detail
`name`	Type of the component for which diagnostics are collected. Currently supported: `kafka`, `connect`. Required.
`componentConfigurationFile`	Optional path to the component configuration file.
`log4jConfigurationFile`	Optional path to the log4j configuration file.
`log4jDirectories`	Optional list of directories that contain the log4j log files to be collected.

Upload the files

Once you have generated the diagnostics bundle, it is important that you upload them in a secure manner. In addition, you should never upload sensitive information to Confluent. If you need to sanitize your files, see Evaluate and modify the diagnostics output.

To upload files, use Secure File Transfer, which enables file encryption and tracking of users that access the files. For more information, see Required Access to Confluent Network Sites.

Errors and troubleshooting

Info logs for the tool are output to the console by default. To view debug logs in the console, use the --verbose/-v switch.

If an exception occurs, only the first line of an exception is output to the console. If you are using the collect command and an exception occurs, you can view the full stack trace in the application log in the output directory. This file will be named diagnostics-output-<hostname>-<YYYY>-<MM>-<DD>-<HH>-<MM>-<ss>/_meta/diagnostics.log.

If the tool encounters failures while collecting diagnostics, the tool collects what it can. For example, if the tool cannot collect log files due to a permissions issues or cannot run a command due to missing dependencies, the tool will run and collect the remaining diagnostics.

If the tool cannot discover components, plan the diagnostics collection, or collect any diagnostics, this is considered a fatal error and the tool does not run. For example, this occurs when a YAML file provided to the collect command is invalid or non-existent.

You can use the information provided in the console or check the tool logs to determine what error occured.

Troubleshoot common issues

Use the following information to troubleshoot common errors that might occur.

Error message	Root cause/fix
Error: failed to create a directory during collection	This is a fatal error that occurs when the tool does not have permission to write to the current directory. Make sure the user running the tool has appropriate write permissions.
Error: The file provided via `--from-discover/-d` contains unsupported component(s) or Error: The file provided via `--from-config/-c` contains unsupported component(s)	This is a fatal error that occurs when the component type in a discover or config file is not supported. Currently `kafka` and `connect` are supported component types.
Error: failed to load the file provided via `--from-plan/-p`	This is a fatal error that occurs when the plan file cannot be located, or is not valid YAML. This can also occur when the file is formatted correctly, but a field does not meet the validation requirements. For example, a required field is omitted. Make sure you provided the correct path to the file, that it is valid YAML, and it meets the requirements specified in the plan section.
No diagnostic output folder is generated.	Note that an output folder is not generated for `plan` or `discover`. If you are running the `collect` command, this indicates a fatal error occurred before the diagnostic output was generated. Use the -v option to enable verbose logging and further diagnose the issue.
No components are discovered although supported components are running.	This is a non-fatal error that occurs when the user running the diagnostic tool does not have permissions to access the relevant JVMs. This is required because the tool runs jps. To mitigate this issue, ensure that a user with the correct permissions runs the tool, such as the user who started the Confluent components.

Release notes

[Dec 8, 2023] Version 1.0.1

Upgraded FasterXML/jackson to version 2.15.3 due to possible CVE-2023-35116 vulnerability.