Troubleshoot Unified Stream Manager

This guide helps you solve common issues with the Unified Stream Manager (USM) Agent.

Troubleshoot the USM Agent

This topic provides information about troubleshooting USM Agent, the available diagnostic tools, and common issues and their solutions.

Diagnostic tools

The USM Agent includes the following built-in tools to help diagnose and resolve connectivity and configuration issues.

  • usm-agent-validz: Validates your Confluent Cloud configuration and credentials.
  • usm-agent-logcollector: Collects network traffic logs for analysis.
  • usm-agent-logparser: Parses collected logs into a readable format.

Initial diagnostic steps

Before investigating specific error scenarios, run the usm-agent-validz tool. This is the fastest way to check your USM Agent’s credentials and its connection to Confluent Cloud.

The validation tool runs on the USM Agent’s control plane listener, which defaults to port 9999 and uses the /validz endpoint.

For Confluent for Kubernetes, use the following command:

kubectl confluent cluster usmagent usm-agent-validz

For Ansible Playbooks for Confluent Platform, use the following command:

# Default port 9999
usm-agent-validz

# With specific port, for example, port 9510
usm-agent-validz 9510

If the validation command succeeds, your basic connectivity and credentials are correct. If it fails, refer to the common error messages below.

Interpreting usm-agent-validz errors

  • HTTP 401 unauthorized

    This error indicates that the Confluent Cloud API key or secret used by the agent is incorrect or has been revoked.

    To fix this issue, do the following:

    1. Verify the API key and secret in your USM Agent configuration file.
    2. In the Confluent Cloud console, confirm that the API key is active.
    3. If you make changes, restart the USM Agent to apply the new configuration.
  • HTTP 503 service unavailable

    This error suggests the agent cannot reach the configured Confluent Cloud endpoint.

    To fix this issue, check the following common causes:

    1. Verify the Confluent Cloud endpoint URL in the USM Agent’s configuration file. Ensure it is correct for your Confluent Cloud environment and region.
    2. Confirm that there is network connectivity from the agent’s host to the Confluent Cloud endpoint. Check for any firewalls, proxies, or network security groups that might be blocking outbound HTTPS traffic on port 443.
    3. If you are using PrivateLink, verify that the endpoint is correctly configured, active, and deployed in the same cloud region as your Confluent Platform cluster.

Common issues and solutions

Issue: User authentication failed between Confluent Platform and USM Agent

Confluent Platform logs show an authentication failure between Kafka or Connect components and the USM Agent.

Solution: To fix this issue, do the following:

  1. Check the username and password that are configured in your|ak| or Connect components.
  2. Verify these credentials match the ones defined in the USM Agent’s configuration.
  3. After you correct any mismatches, restart the affected Kafka or Connect components.

Issue: Metadata updates from Confluent Platform don’t appear in Confluent Cloud

Actions that you perform in your Confluent Platform cluster, such as creating, updating, or deleting a topic, are not reflected in the Confluent Cloud UI. This problem indicates a misconfiguration of the USM Agent endpoint in the Confluent Platform components, such as the KRaft controller.

Solution: To fix this issue, do the following:

  1. Check USM Agent logs for metadata events.

    First, verify that the metadata event requests from Confluent Platform are reaching the USM Agent. You can check the agent logs and filter for event traffic.

    For Confluent for Kubernetes, run the following command:

    kubectl logs -f <usm-agent-pod-name> | grep -i events
    

    If this command returns no output when you create a topic, the requests are not reaching the agent. Proceed to the next step.

  2. Inspect KRaft controller Logs for connection errors.

    Check the KRaft controller logs for any connectivity or DNS resolution errors related to the USM Agent endpoint.

    For Confluent for Kubernetes, use the following command:

    kubectl logs <kraftcontroller-pod-name>
    

    An error message similar to the following confirms a misconfiguration:

    io.confluent.telemetry.client.TelemetryClientException: Failed to send request POST http://usm-agent.namespace.svc.cluster.local:9999/v1/metrics after 3 attempt(s). Last error: java.net.UnknownHostException: usm-agent.namespace.svc.cluster.local: Name or service not known
    
  3. Fix the USM Agent endpoint by correcting the URL in your deployment configuration.

    • For Confluent for Kubernetes: Update the usmAgentClient URL in the relevant Kubernetes deployment YAML file or Helm chart values.
    • For Ansible Playbooks for Confluent Platform: Update the appropriate variable in your playbook or inventory file.

After updating the configuration, redeploy Confluent Platform for the changes to take effect.

Issue: Agent logs show 400 or 500 errors (environment mismatch)

Metrics and events are missing from Confluent Cloud, and the agent logs show 400 and 500 errors. This issue occurs when the Confluent Platform cluster is registered in one Confluent Cloud environment, but the USM Agent is misconfigured to point to a different environment.

Solution: Ensure that the USM Agent and its corresponding Confluent Platform cluster are configured
for the same Confluent Cloud environment.

Issue: Agent logs show 403 Forbidden errors (Missing Permissions)

Metrics and events are missing, and agent logs show 403 forbidden errors. This indicates an authorization failure. The agent’s API key is valid, but its service account lacks permissions.

Solution: In the Confluent Cloud console, grant the USMAgent role to the service account associated with the API key used by the USM Agent.

Issue: Agent logs show 403 forbidden errors (region mismatch)

Metadata updates fail, and Agent logs show 403 forbidden errors. This error occurs when the Confluent Platform cluster and the Confluent Cloud PrivateLink endpoint that it uses are in different cloud regions.

Solution: Ensure that your Confluent Platform cluster and the PrivateLink endpoint that it connects to are deployed in the same cloud region.

Issue: Topic count is inaccurate or fluctuates in Confluent Cloud

The total number of topics displayed in the Confluent Cloud Console is unstable or incorrect. For example, the topic count might be high one moment and then drop significantly the next.

This issue can occur if your cluster contains a large number of topics and the default packet size for the topic list snapshot, which is sent from your Confluent Platform cluster to Confluent Cloud is too small to include all topics.

Solution: To resolve this issue, increase the maximum snapshot size in the configuration for your Confluent Platform controllers.

  1. On each of your Confluent Platform controller hosts, open the server.properties file.

  2. Add the following property and set its value:

    confluent.catalog.collector.max.bytes.per.snapshot: 2000000
    

    This setting increases the maximum packet size to 2 MB, which allows a larger topic list to be sent in a single snapshot.

  3. Restart your controllers for the configuration change to take effect.

After the controllers restart, the topic count in the Confluent Cloud Console should stabilize and accurately reflect the number of topics in your cluster.

Issue: Telemetry metrics are missing or delayed in large-scale clusters

In large Confluent Platform clusters, for example, clusters with over 10,000 topics and schemas, you might see a drop in telemetry metrics being sent to Confluent Cloud. This issue can occur because the default batch size and buffering limits for the telemetry exporter are too small to handle the high volume of data that a large-scale cluster generates.

Solution: To resolve this issue, increase the telemetry batching and buffering capacity and enable compression. Add the following properties to each Kafka broker configuration, for example, in your server.properties file or equivalent custom properties configuration.

confluent.telemetry.exporter._usm.client.compression: gzip
confluent.telemetry.exporter._usm.buffer.batch.items.max: 4000
confluent.telemetry.exporter._usm.buffer.inflight.submissions.max: 10
confluent.telemetry.exporter._usm.buffer.pending.batches.max: 80

After you update the configuration, perform a rolling restart of your Kafka brokers to apply the changes. This should stabilize the metric reporting for your large-scale environment.

Advanced troubleshooting with log analysis

If the initial diagnostic steps doesn’t solve the issue, you can capture and parse the network traffic that flows through the USM Agent.

Step 1: Collect traffic logs

Use the usm-agent-logcollector tool to capture live network traffic for a specific duration. By default, it collects traffic for 60 seconds.

For Confluent for Kubernetes, use the following command:

# Collect logs for 60 seconds (default)
kubectl confluent cluster usmagent logcollector

For Ansible Playbooks for Confluent Platform, use the following command:

# Basic collection
usm-agent-logcollector

If you encounter a permission error such as error accessing directory /var/log/confluent/usm-agent/tap/: stat /var/log/confluent/usm-agent/tap/: permission denied, choose one of the following options:

  • Run the command as the cp-usm-agent user, which has the necessary permissions for the default path.
sudo -u cp-usm-agent usm-agent-logcollector
  • If you don’t have permissions for the default path, create a new custom directory for log collection and then run usm-agent-logcollector with that directory as output directory.
mkdir /tmp/usm-log-collection
usm-agent-logcollector -output-dir /tmp/usm-log-collection

Step 2: Analyze collected logs

After collecting the logs, use the usm-agent-logparser tool to convert the raw log files into a human-readable format.

For Confluent for Kubernetes, use the following command:

# Parse logs to console
kubectl confluent cluster usmagent logparser

For Ansible Playbooks for Confluent Platform, use the following command:

# Parse logs to console
usm-agent-logparser

If you encounter a permission error such as error accessing directory /var/log/confluent/usm-agent/tap/: stat /var/log/confluent/usm-agent/tap/: permission denied, choose one of the following options:

  • Run the command as the cp-usm-agent user, which has the necessary permissions for the default path.
sudo -u cp-usm-agent usm-agent-logparser
  • If you don’t have permissions for the default path, create a new custom directory and then run usm-log-parsed with that directory as output directory.
# (Optional) Create a directory for the parsed output
 mkdir /tmp/usm-log-parsed
# Parse logs from a custom input directory to a new output directory
 usm-agent-logparser -in /tmp/usm-log-collection -out /tmp/usm-log-parsed

Detailed tool reference

usm-agent-validz - This tool tests connectivity and credential validation.

  • For Confluent for Kubernetes, use the following command:

    # Default validation (recommended)
    kubectl confluent cluster usmagent usm-agent-validz
    
    # Advanced validation with specific endpoints
    kubectl confluent cluster usmagent usm-agent-validz <port>
    
  • For Ansible Playbooks for Confluent Platform, use the following command:

    # Basic usage
    usm-agent-validz
    
    # With specific port
    usm-agent-validz <port>
    

usm-agent-logcollector - This tool captures network traffic for analysis. The usm-agent-logcollector taps the traffic and stores it in a raw binary format, which is a .pb file in the output directory. This file contains complete information about the request and payload, includes sensitive data, such as Confluent Cloud credentials that are configured on Confluent Platform components like Kafka and connectors, because it is a raw tap of the traffic.

It includes the following parameters:

  • -admin-url - USM Agent admin URL (default: http://localhost:9901/tap).

  • -duration - Collection time in seconds (default: 60).

  • -output-dir - The directory where captured log files are saved. (default: /var/log/confluent/usm-agent/tap/).

  • -errors-only - Capture only HTTP errors (300-599 status codes).

  • -req-type - Filters captured traffic by request type (events, metrics, or all).

  • For Confluent for Kubernetes, use the following command:

    # Basic collection
    kubectl confluent cluster usmagent logcollector
    
  • For Ansible Playbooks for Confluent Platform, use the following command:

    # Basic collection
    usm-agent-logcollector
    
    # Extended collection focusing on errors
    usm-agent-logcollector -duration 300 -errors-only
    
    # Collect only metrics with custom output directory
    usm-agent-logcollector -req-type metrics -output-dir /tmp/metrics-logs
    

usm-agent-logparser - This tool parses collected traffic logs into a readable format. The usm-agent-logparser tool processes the raw .pb files collected by usm-agent-logcollector.

  • In normal mode (default), only the request payload is logged. Because credentials are sent as HTTP headers, there is no need to redact passwords in this mode.
  • When run with the -raw mode flag, the parser logs all request headers but will attempt to redact the password or secret for security.

It includes the following parameters:

  • -in - Input file or directory (default: /var/log/confluent/usm-agent/tap).

  • -out - Output directory (default: stdout).

  • For Confluent for Kubernetes, use the following command:

    # Basic parsing
    kubectl confluent cluster usmagent logparser
    
  • For Ansible Playbooks for Confluent Platform, use the following command:

    # Parse to console
    usm-agent-logparser
    
    # Parse to file
    usm-agent-logparser -out /tmp/parsed-output
    

Log locations and access

For Confluent for Kubernetes, all logs from the USM Agent Pod are combined into a single stream. To view the logs, use the following command:

kubectl logs -f <usm-agent-pod-name>

For Ansible Playbooks for Confluent Platform, logs are separated by type into different files.

  • Access logs: These logs are located at /var/log/confluent/usm-agent/usm-agent_access.log. These logs are formatted as:

    [%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS_LONG% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" "%UPSTREAM_PROTOCOL%"
    
  • Application logs: These logs are located at /var/log/confluent/usm-agent/usm-agent_application.log

  • Traffic captures: These logs are located in the /var/log/confluent/usm-agent/tap/ directory.