USM Agent: sizing, high availability, and monitoring
This page describes key architectural and operational considerations for the Unified Stream Manager (USM) Agent. The configuration for sizing, high availability, and monitoring depends on your deployment method. Find the section below that matches your environment: Confluent for Kubernetes or Confluent Ansible.
For information about networking configuration including IPv6 and dual-stack support, see USM Agent networking configuration.
Confluent for Kubernetes deployments
When you deploy with Confluent for Kubernetes (CFK), the CFK automates most configurations for sizing and high availability.
Sizing and scaling
CFK automatically manages resource allocation for the USM Agent, so you don’t need to configure it manually. The CFK sets the following default resources for the usm-agent container:
Requests: 100m CPU, 128Mi memory
Limits: 300m CPU, 256Mi memory
To scale out, increase the replicas for the USM Agent in your custom resource. For information about overriding these defaults, see Specify CPU and memory requests.
High availability
To achieve high availability for the USM Agent, use standard CFK patterns by increasing the replica count in your custom resource. For a redundant setup, a minimum of two replicas is recommended.
To expose the USM Agent externally, follow the standard CFK procedures for configuring load balancers.
Monitoring
You can monitor USM Agent logs for troubleshooting and scrape Prometheus metrics for performance analysis.
Logs
The USM Agent provides three types of logs, which are accessed differently in a Kubernetes environment:
Application and access logs: These are available by default in the standard Kubernetes Pod logs. The agent sends application logs to
stderrand access logs tostdout. This lets you configure a logging agent, such asFluentd,Logstash, orFilebeat, to capture and manage these streams separately.Traffic logs: You can extract these logs by configuring the
logcollectorcomponent. These logs are located in the/var/log/confluent/usm-agent/tap/directory.
Metrics with Prometheus
The USM Agent exposes Prometheus-compatible metrics for monitoring the agent’s own performance, health, and traffic patterns. These are Envoy proxy metrics that cover HTTP request and response statistics, upstream cluster health, connection pools, and system resource utilization.
For information about the metrics and metadata that the USM Agent collects from your Kafka and Connect clusters and sends to Confluent Cloud, see USM Agent: Metrics and Metadata Reference.
Metrics endpoint
By default, the monitoring listener binds to port 9910 and exposes metrics at the /stats/prometheus endpoint. The security configuration for this endpoint, including the protocol (http or https) and any authentication requirements, mirrors the configuration of the main dataplane listener.
Prometheus configuration for Kubernetes
To scrape metrics from multiple USM Agent instances in Kubernetes, use the Prometheus Operator with a PodMonitor resource.
Recommended: Using PodMonitor
If you have the Prometheus Operator installed, create a PodMonitor to automatically discover and scrape all USM Agent pods:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: usm-agent-monitor
namespace: <namespace>
spec:
selector:
matchLabels:
app: usm-agent
podMetricsEndpoints:
- port: admin
path: /stats/prometheus
interval: 60s
scheme: http
scrapeTimeout: 30s
Replace <namespace> with your USM Agent namespace.
Alternative: Using a Service
Confluent for Kubernetes exposes the external port for the USM Agent but does not expose the monitoring port by default. For a single USM Agent instance, complete the following steps:
Create a Kubernetes Service to expose port
9910:apiVersion: v1 kind: Service metadata: name: usm-agent-metrics namespace: <namespace> spec: ports: - name: prometheus port: 9910 protocol: TCP targetPort: 9910 selector: app: usm-agent
Replace
<namespace>with your USM Agent namespace.Configure Prometheus to scrape the agent directly:
scrape_configs: - job_name: 'usm-agent' scrape_interval: 15s metrics_path: /stats/prometheus # Uncomment and configure if USM Agent uses mTLS authentication # scheme: https # tls_config: # cert_file: '/path/to/client.pem' # key_file: '/path/to/client.key' # ca_file: '/path/to/cacerts.pem' # Uncomment and configure if USM Agent uses basic authentication # basic_auth: # username: <username> # password: <password> static_configs: - targets: ['<usm-agent-service>.<namespace>.svc.cluster.local:9910']
Replace
<usm-agent-service>with your USM Agent service name and<namespace>with your namespace.
For a complete list of available metrics and their descriptions, see the Envoy statistics documentation.
Visualization with Grafana
To visualize USM Agent metrics in Grafana, import the pre-built Envoy proxy monitoring dashboard into your Grafana instance.
Ansible Playbooks for Confluent Platform deployments
While Ansible Playbooks for Confluent Platform automates the deployment and configuration of the USM Agent, you must manually manage the underlying infrastructure for your Linux servers.
Sizing and scaling
Properly sizing the USM Agent is critical for performance and stability.
Vertical sizing: CPU and memory
For a server, such as virtual machine or bare-metal, that hosts an USM Agent instance, a minimum configuration of 2 vCPU cores and 2 GB of RAM is recommended.
This baseline provides a stable environment with sufficient resources for both the agent process and the underlying operating system. After deployment, monitor the server’s CPU and memory utilisation and adjust these resources to meet the specific demands of your workload.
Operating system requirements
For non-containerized deployments, you must use Red Hat Enterprise Linux (RHEL) 9 or later. RHEL 8 does not support native USM Agent installation. For RHEL 8 VMs, you can deploy the USM Agent as a container using Podman.
Horizontal sizing: adding instances
Add multiple agents primarily to achieve high availability. This ensures service continuity if one agent fails, as traffic can be redirected to a healthy instance.
If high availability is not a requirement, running one larger, vertically scaled agent is more resource-efficient than running multiple smaller agents.
High availability
For critical environments, use one of the following methods to make the USM Agent highly available on Linux servers.
Use a load balancer
This approach involves placing an HTTP-based load balancer in front of two or more USM Agent instances to distribute traffic and manage failover automatically. The load balancer runs continuous health checks on each agent. If it detects a failure, it automatically routes traffic away from the unhealthy instance to a healthy one.
Use a virtual IP
This method provides network-level failover without a dedicated load balancer, typically in an active-passive configuration. It requires clustering software to manage a shared IP address.
Two servers, a primary and a standby, both run the USM Agent. A single, floating Virtual IP (VIP) is assigned to the primary server.
All clients are configured to connect to this single VIP, not to the individual server IPs.
The clustering software constantly monitors the health of the primary agent. If it fails, the software automatically reassigns the VIP from the failed server to the standby server.
The standby server is instantly promoted to primary and begins to handle all incoming traffic.
Monitoring
You can monitor agent logs for troubleshooting and scrape Prometheus metrics for performance analysis.
Logs
The agent writes three types of logs to files on disk:
Application logs: Contain general runtime information and errors. These logs are located at
/var/log/confluent/usm-agent/usm-agent_application.log.Access logs: Provide a detailed record of every request handled by the agent. These logs are located at
/var/log/confluent/usm-agent/usm-agent_access.log.Traffic logs: Contain detailed, structured records of the traffic that is being processed. These logs are located in the
/var/log/confluent/usm-agent/tap/directory.
Metrics with Prometheus
The USM Agent exposes Prometheus-compatible metrics for monitoring its performance, health, and traffic patterns. These are Envoy proxy metrics that cover HTTP request and response statistics, upstream cluster health, connection pools, and system resource utilization.
For information about the metrics and metadata that the USM Agent collects from your Kafka and Connect clusters and sends to Confluent Cloud, see USM Agent: Metrics and Metadata Reference.
Metrics endpoint
By default, the monitoring listener binds to port 9910 and exposes metrics at the /stats/prometheus endpoint. The security configuration for this endpoint, including the protocol (http or https) and any authentication requirements, mirrors the configuration of the main dataplane listener.
To access metrics from a local agent instance:
curl http://localhost:9910/stats/prometheus
If the dataplane listener uses TLS or basic authentication, apply the same credentials to the monitoring endpoint.
Prometheus configuration
Configure your Prometheus server to scrape the agent’s metrics endpoint. Add the following to your prometheus.yml file:
scrape_configs:
- job_name: 'usm-agent'
scrape_interval: 15s
metrics_path: /stats/prometheus
# Uncomment and configure if USM Agent uses mTLS authentication
# scheme: https
# tls_config:
# cert_file: '/path/to/client.pem'
# key_file: '/path/to/client.key'
# ca_file: '/path/to/cacerts.pem'
# Uncomment and configure if USM Agent uses basic authentication
# basic_auth:
# username: <username>
# password: <password>
static_configs:
- targets: ['<agent-host>:9910']
Replace <agent-host> with the hostname or IP address of your USM Agent instance.
For a complete list of available metrics and their descriptions, see the Envoy statistics documentation.
Visualization with Grafana
To visualize USM Agent metrics in Grafana, import the pre-built Envoy proxy monitoring dashboard into your Grafana instance.