Troubleshoot Unified Stream Manager

This guide helps you solve common issues with the Unified Stream Manager (USM) Agent.

Troubleshoot the USM Agent

This topic provides information about troubleshooting USM Agent, the available diagnostic tools, and common issues and their solutions.

Diagnostic tools

The USM Agent includes the following built-in tools to help diagnose and resolve connectivity and configuration issues.

usm-agent-validz: Validates your Confluent Cloud configuration and credentials.
usm-agent-logcollector: Collects network traffic logs for analysis.
usm-agent-logparser: Parses collected logs into a readable format.

Initial diagnostic steps

Before investigating specific error scenarios, run the usm-agent-validz tool. This is the fastest way to check your USM Agent’s credentials and its connection to Confluent Cloud.

The validation tool runs on the USM Agent’s control plane listener, which defaults to port 9999 and uses the /validz endpoint.

For Confluent for Kubernetes, use the following command:

kubectl confluent cluster usmagent usm-agent-validz

For Ansible Playbooks for Confluent Platform, use the following command:

# Default port 9999
usm-agent-validz

# With specific port, for example, port 9510
usm-agent-validz 9510

If the validation command succeeds, your basic connectivity and credentials are correct. If it fails, refer to the common error messages below.

Interpreting `usm-agent-validz` errors

HTTP 401 unauthorized
This error indicates that the Confluent Cloud API key or secret used by the agent is incorrect or has been revoked.
To fix this issue, do the following:
1. Verify the API key and secret in your USM Agent configuration file.
2. In the Confluent Cloud console, confirm that the API key is active.
3. If you make changes, restart the USM Agent to apply the new configuration.
HTTP 503 service unavailable
This error suggests the agent cannot reach the configured Confluent Cloud endpoint.
To fix this issue, check the following common causes:
1. Verify the Confluent Cloud endpoint URL in the USM Agent’s configuration file. Ensure it is correct for your Confluent Cloud environment and region.
2. Confirm that there is network connectivity from the agent’s host to the Confluent Cloud endpoint. Check for any firewalls, proxies, or network security groups that might be blocking outbound HTTPS traffic on port 443.
3. If you are using PrivateLink, verify that the endpoint is correctly configured, active, and deployed in the same cloud region as your Confluent Platform cluster.

Common issues and solutions

Issue: User authentication failed between Confluent Platform and USM Agent

Confluent Platform logs show an authentication failure between Kafka or Connect components and the USM Agent.

Solution: To fix this issue, do the following:

Check the username and password that are configured in your|ak| or Connect components.
Verify these credentials match the ones defined in the USM Agent’s configuration.
After you correct any mismatches, restart the affected Kafka or Connect components.

Issue: Metadata updates from Confluent Platform don’t appear in Confluent Cloud

Actions that you perform in your Confluent Platform cluster, such as creating, updating, or deleting a topic, are not reflected in the Confluent Cloud UI. This problem indicates a misconfiguration of the USM Agent endpoint in the Confluent Platform components, such as the KRaft controller.

Solution: To fix this issue, do the following:

Check USM Agent logs for metadata events.
First, verify that the metadata event requests from Confluent Platform are reaching the USM Agent. You can check the agent logs and filter for event traffic.
For Confluent for Kubernetes, run the following command:
```
kubectl logs -f <usm-agent-pod-name> | grep -i events
```
If this command returns no output when you create a topic, the requests are not reaching the agent. Proceed to the next step.

Inspect KRaft controller Logs for connection errors.

Check the KRaft controller logs for any connectivity or DNS resolution errors related to the USM Agent endpoint.

For Confluent for Kubernetes, use the following command:

kubectl logs <kraftcontroller-pod-name>

An error message similar to the following confirms a misconfiguration:

io.confluent.telemetry.client.TelemetryClientException: Failed to send request POST http://usm-agent.namespace.svc.cluster.local:9999/v1/metrics after 3 attempt(s). Last error: java.net.UnknownHostException: usm-agent.namespace.svc.cluster.local: Name or service not known

Fix the USM Agent endpoint by correcting the URL in your deployment configuration.
- For Confluent for Kubernetes: Update the usmAgentClient URL in the relevant Kubernetes deployment YAML file or Helm chart values.
- For Ansible Playbooks for Confluent Platform: Update the appropriate variable in your playbook or inventory file.

After updating the configuration, redeploy Confluent Platform for the changes to take effect.

Issue: Agent logs show 400 or 500 errors (environment mismatch)

Metrics and events are missing from Confluent Cloud, and the agent logs show 400 and 500 errors. This issue occurs when the Confluent Platform cluster is registered in one Confluent Cloud environment, but the USM Agent is misconfigured to point to a different environment.

Solution: Ensure that the USM Agent and its corresponding Confluent Platform cluster are configured: for the same Confluent Cloud environment.

Issue: Agent logs show 403 Forbidden errors (Missing Permissions)

Metrics and events are missing, and agent logs show 403 forbidden errors. This indicates an authorization failure. The agent’s API key is valid, but its service account lacks permissions.

Solution: In the Confluent Cloud console, grant the USMAgent role to the service account associated with the API key used by the USM Agent.

Issue: Agent logs show 403 forbidden errors (region mismatch)

Metadata updates fail, and Agent logs show 403 forbidden errors. This error occurs when the Confluent Platform cluster and the Confluent Cloud PrivateLink endpoint that it uses are in different cloud regions.

Solution: Ensure that your Confluent Platform cluster and the PrivateLink endpoint that it connects to are deployed in the same cloud region.

Issue: Topic count is inaccurate or fluctuates in Confluent Cloud

The total number of topics displayed in the Confluent Cloud Console is unstable or incorrect. For example, the topic count might be high one moment and then drop significantly the next.

This issue can occur if your cluster contains a large number of topics and the default packet size for the topic list snapshot, which is sent from your Confluent Platform cluster to Confluent Cloud is too small to include all topics.

Solution: To resolve this issue, increase the maximum snapshot size in the configuration for your Confluent Platform controllers.

For instructions, see Configure topic reporting for large clusters.

Issue: Telemetry metrics are missing or delayed in large-scale clusters

In large Confluent Platform clusters, for example, clusters with over 10,000 topics and schemas, you might see a drop in telemetry metrics being sent to Confluent Cloud. This issue can occur because the default batch size and buffering limits for the telemetry exporter are too small to handle the high volume of data that a large-scale cluster generates.

Solution: To resolve this issue, increase the telemetry batching and buffering capacity and enable compression. Add the following properties to each Kafka broker configuration, for example, in your server.properties file or equivalent custom properties configuration.

confluent.telemetry.exporter._usm.client.compression: gzip
confluent.telemetry.exporter._usm.buffer.batch.items.max: 4000
confluent.telemetry.exporter._usm.buffer.inflight.submissions.max: 10
confluent.telemetry.exporter._usm.buffer.pending.batches.max: 80

After you update the configuration, perform a rolling restart of your Kafka brokers to apply the changes. This should stabilize the metric reporting for your large-scale environment.

Advanced troubleshooting with log analysis

If the initial diagnostic steps doesn’t solve the issue, you can capture and parse the network traffic that flows through the USM Agent.

Step 1: Collect traffic logs

Use the usm-agent-logcollector tool to capture live network traffic for a specific duration. By default, it collects traffic for 60 seconds.

For Confluent for Kubernetes, use the following command:

# Collect logs for 60 seconds (default)
kubectl confluent cluster usmagent logcollector

For Ansible Playbooks for Confluent Platform, use the following command:

# Basic collection
usm-agent-logcollector

If you encounter a permission error such as error accessing directory /var/log/confluent/usm-agent/tap/: stat /var/log/confluent/usm-agent/tap/: permission denied, choose one of the following options:

Run the command as the cp-usm-agent user, which has the necessary permissions for the default path.

sudo -u cp-usm-agent usm-agent-logcollector

If you don’t have permissions for the default path, create a new custom directory for log collection and then run usm-agent-logcollector with that directory as output directory.

mkdir /tmp/usm-log-collection
usm-agent-logcollector -output-dir /tmp/usm-log-collection

Step 2: Analyze collected logs

After collecting the logs, use the usm-agent-logparser tool to convert the raw log files into a human-readable format.

For Confluent for Kubernetes, use the following command:

# Parse logs to console
kubectl confluent cluster usmagent logparser

For Ansible Playbooks for Confluent Platform, use the following command:

# Parse logs to console
usm-agent-logparser

If you encounter a permission error such as error accessing directory /var/log/confluent/usm-agent/tap/: stat /var/log/confluent/usm-agent/tap/: permission denied, choose one of the following options:

Run the command as the cp-usm-agent user, which has the necessary permissions for the default path.

sudo -u cp-usm-agent usm-agent-logparser

If you don’t have permissions for the default path, create a new custom directory and then run usm-log-parsed with that directory as output directory.

# (Optional) Create a directory for the parsed output
 mkdir /tmp/usm-log-parsed
# Parse logs from a custom input directory to a new output directory
 usm-agent-logparser -in /tmp/usm-log-collection -out /tmp/usm-log-parsed

Detailed tool reference

usm-agent-validz - This tool tests connectivity and credential validation.

For Confluent for Kubernetes, use the following command:

# Default validation (recommended)
kubectl confluent cluster usmagent usm-agent-validz

# Advanced validation with specific endpoints
kubectl confluent cluster usmagent usm-agent-validz <port>

For Ansible Playbooks for Confluent Platform, use the following command:

# Basic usage
usm-agent-validz

# With specific port
usm-agent-validz <port>

usm-agent-logcollector - This tool captures network traffic for analysis. The usm-agent-logcollector taps the traffic and stores it in a raw binary format, which is a .pb file in the output directory. This file contains complete information about the request and payload, includes sensitive data, such as Confluent Cloud credentials that are configured on Confluent Platform components like Kafka and connectors, because it is a raw tap of the traffic.

It includes the following parameters:

-admin-url - USM Agent admin URL (default: http://localhost:9901/tap).
-duration - Collection time in seconds (default: 60).
-output-dir - The directory where captured log files are saved. (default: /var/log/confluent/usm-agent/tap/).
-errors-only - Capture only HTTP errors (300-599 status codes).
-req-type - Filters captured traffic by request type (events, metrics, or all).

For Confluent for Kubernetes, use the following command:

# Basic collection
kubectl confluent cluster usmagent logcollector

For Ansible Playbooks for Confluent Platform, use the following command:

# Basic collection
usm-agent-logcollector

# Extended collection focusing on errors
usm-agent-logcollector -duration 300 -errors-only

# Collect only metrics with custom output directory
usm-agent-logcollector -req-type metrics -output-dir /tmp/metrics-logs

usm-agent-logparser - This tool parses collected traffic logs into a readable format. The usm-agent-logparser tool processes the raw .pb files collected by usm-agent-logcollector.

In normal mode (default), only the request payload is logged. Because credentials are sent as HTTP headers, there is no need to redact passwords in this mode.
When run with the -raw mode flag, the parser logs all request headers but will attempt to redact the password or secret for security.

It includes the following parameters:

-in - Input file or directory (default: /var/log/confluent/usm-agent/tap).
-out - Output directory (default: stdout).

For Confluent for Kubernetes, use the following command:

# Basic parsing
kubectl confluent cluster usmagent logparser

For Ansible Playbooks for Confluent Platform, use the following command:

# Parse to console
usm-agent-logparser

# Parse to file
usm-agent-logparser -out /tmp/parsed-output

Log locations and access

For Confluent for Kubernetes, all logs from the USM Agent Pod are combined into a single stream. To view the logs, use the following command:

kubectl logs -f <usm-agent-pod-name>

For Ansible Playbooks for Confluent Platform, logs are separated by type into different files.

Access logs: These logs are located at /var/log/confluent/usm-agent/usm-agent_access.log. These logs are formatted as:

[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS_LONG% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" "%UPSTREAM_PROTOCOL%"

Application logs: These logs are located at /var/log/confluent/usm-agent/usm-agent_application.log
Traffic captures: These logs are located in the /var/log/confluent/usm-agent/tap/ directory.

Troubleshoot Unified Stream Manager

Troubleshoot the USM Agent

Diagnostic tools

Initial diagnostic steps

Interpreting usm-agent-validz errors

Common issues and solutions

Issue: User authentication failed between Confluent Platform and USM Agent

Issue: Metadata updates from Confluent Platform don’t appear in Confluent Cloud

Issue: Agent logs show 400 or 500 errors (environment mismatch)

Issue: Agent logs show 403 Forbidden errors (Missing Permissions)

Issue: Agent logs show 403 forbidden errors (region mismatch)

Issue: Topic count is inaccurate or fluctuates in Confluent Cloud

Issue: Telemetry metrics are missing or delayed in large-scale clusters

Advanced troubleshooting with log analysis

Detailed tool reference

Log locations and access

Interpreting `usm-agent-validz` errors