Health+ Intelligent Alerts for Confluent Platform

Health+ actively monitors the performance and configuration of your Confluent Platform services. With Intelligent Alerts, you get a set of notifications to alert you to potential environmental issues before they become critical problems. Use Intelligent Alerts to be notified of unfavorable operational conditions before they lead to costly downtime or service outage.

When metrics that trigger alerts go back to a normal state, Health+ sends a notification to subscribed channels that the alerts have been resolved.

Important

If you are operating your cluster in KRaft mode, controllers are currently reported as brokers, and alerts may not function as expected. For more information, see KRaft limitations and known issues.

Health+ provides a set of commonly used alerts with the free tier, and an extended set of alerts with the premium tier.

Set up notifications for your Confluent Platform deployment on the Manage Notifications page in Confluent Cloud Console.

If you have difficulties resolving an issue that’s reported by a Health+ alert, contact Confluent Support for further help.

Alert severity levels

The following table describes the supported Health+ alert severity levels. Not every monitored metric has all alert levels.

State Summary Description
Critical Highly recommended actions Issues are present that may limit or prevent data from moving across your cluster. We recommend you address these with priority.
Warning Potential future issues These metrics are close to exceeding their normal operational range and may cause future issues. We recommend you review these metrics and their recommended actions.
Info Informational events Informational events on the normal operation of your cluster. We recommend you review these.

Premium Tier alerts

Disk Usage

Alerts when the total number of bytes for a disk volume crosses one of the specified thresholds.

Severity level Severity threshold Description
Critical > 90% volume usage Available disk volume is nearly exhausted.
Warning > 70% volume usage Volume usage is high.
Info > 50% volume usage Greater than 50% volume usage is still considered within normal operating range.

Fetch Request Latency

Having fetch request latency on your brokers is normal, but sizable increases in latency can mean it’s taking longer for your consumer clients to retrieve your messages.

Severity level Severity threshold Description
Warning Average fetch request latency in the last 15 minutes is >50% above the average broker latency for the last 24 hours. We’ve detected a 50% increase in fetch request latency on one or more of your brokers.
Info Average fetch request latency in the last 15 minutes is >25% above the average broker latency for the last 24 hours. We’ve detected a 25% increase in fetch request latency on one or more of your brokers.

Fetch Follower Request Latency

Severity level Severity threshold Description
Warning Average fetch follower request latency in the last 15 minutes is >50% above the average broker latency for the last 24 hours. We’ve detected a 50% increase in fetch follower request latency on one or more of your brokers.
Info Average fetch follower request latency in the last 15 minutes is >25% above the average broker latency for the last 24 hours. We’ve detected a 25% increase in fetch follower request latency on one or more of your brokers.

Network Processor Pool Usage

The network processor threads are responsible for reading and writing data to and from clients on the network.

Severity level Severity threshold Description
Critical > 90% pool usage At greater than 90% usage, the network processor pool is overloaded and client requests may take longer to complete.
Warning > 70% pool usage Greater than 70% usage is still considered within normal operating range for network processor pool usage, as threads are not computationally intensive.
Info > 50% pool usage Greater than 50% usage is still considered within normal operating range for the network processor pool.

Produce Request Latency

Having produce request latency on your brokers is normal, but sizable increases in latency can lead to longer wait time for your consumers to retrieve your messages.

Severity level Severity threshold Description
Warning Average product request latency in the last 15 minutes is >50% above the average broker latency for the last 24 hours. We’ve detected a 50% increase in produce request latency on one or more of your brokers.
Info Average product request latency in the last 15 minutes is >25% above the average broker latency for the last 24 hours. We’ve detected a 25% increase in produce request latency on one or more of your brokers.

Request Handler Pool Usage

The request handler threads are responsible for serving requests to and from clients, including reading and writing to disk.

Severity level Severity threshold Description
Critical > 90% pool usage At greater than 90% usage, the request handler is overloaded and client requests may take longer to complete.
Warning > 70% pool usage At greater than 70% usage, the request handler is approaching the overloaded range for this broker.
Info > 50% pool usage At greater than 50% usage the request handler is still considered within normal operating range.

Free Tier alerts

Active Controller Count

The Controller is responsible for maintaining the list of partition leaders and coordinating leadership transitions (topic creation).

In normal operation, a new Controller should be automatically selected if the current one becomes unavailable. Issues occur when this does not happen or more than one controller is active.

  • If the active controller count is less than 1, producers and consumers can’t get the partition leaders anymore.
  • If the active controller count is greater than 1, a split-brain situation occurs and may cause serious data integrity issues.
Severity level Severity threshold Description
Critical Active Controller Count != 1 for more than 30 minutes We detected a persistent abnormal state for Active Controller Count in this cluster.
Warning Active Controller Count != 1 for 15 minutes We detected an abnormal state for Active Controller Count in the current cluster.

Connector Is Degraded

One or more connector tasks have failed.

For a given connector, connector_failed_task_count is greater than 0 but is not equal to connector_total_task_count.

Severity level Severity threshold Description
Warning Some Connector tasks have failed, but fewer than connector_total_task_count. The connector is in a degraded state with failed tasks.

Connector Is Failed

All connector tasks have failed.

For a given connector, connector_failed_task_count is equal to connector_total_task_count.

Severity level Severity threshold Description
Critical Connector tasks have failed, and connector_failed_task_count > 0. The connector is in a failed state.

ksqlDB Error Queries

For a given ksqlDB engine, the number of queries generating errors is greater than 0 for longer than 1 minute.

Severity level Severity threshold Description
Critical error_queries > 0 There are queries in error state in ksqlDB.

Offline Partitions

An offline partition is a partition that doesn’t have an active leader and therefore isn’t writable or readable. The presence of offline partitions compromises the data availability of the cluster.

Severity level Severity threshold Description
Critical Offline Partitions > 0 The presence of offline partitions compromises the data availability of the cluster.

Unclean Leader Elections

An unclean leader election is a special case in which no available replicas are in sync. Because each topic must have a leader, an election is held among the out-of-sync replicas, and a leader is chosen, meaning any messages that weren’t synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability.

Severity level Severity threshold Description
Critical Unclean Leader Elections > 0 Unclean leader elections sacrifice consistency for availability.

Under Replicated Partitions

Under replicated partitions can happen when a broker is down or can’t replicate fast enough from the leader (replica fetcher lag).

Severity level Severity threshold Description
Critical Under Replicated Partitions > 0 for all data points in surveyed window We’ve detected under replicated partitions on your cluster for an extended period.
Warning Under Replicated Partitions > 0 for some data points in surveyed window We’ve detected under replicated partitions on your cluster for a limited period.

Under Min In-Sync Replicas

If your partitions have fewer than min.insync.replicas, clients can’t produce, due to a NotEnoughReplicas exception. A producer may retry depending on its configuration. When you have a partition under the min ISR, data production is blocked.

Severity level Severity threshold Description
Warning Under Min In Sync Replicas > 0 We’ve detected a partition under the min ISR, and data production is blocked.