Health+ Intelligent Alerts for Confluent Platform¶

Health+ actively monitors the performance and configuration of your Confluent Platform services. With Intelligent Alerts, you get a set of notifications to alert you to potential environmental issues before they become critical problems. Use Intelligent Alerts to be notified of unfavorable operational conditions before they lead to costly downtime or service outage.

When metrics that trigger alerts go back to a normal state, Health+ sends a notification to subscribed channels that the alerts have been resolved.

Important

If you are operating your cluster in KRaft mode, controllers are currently reported as brokers, and alerts may not function as expected. For more information, see KRaft limitations and known issues.

Health+ provides a set of commonly used alerts with the free tier, and an extended set of alerts with the premium tier.

Set up notifications for your Confluent Platform deployment on the Manage Notifications page in Confluent Cloud Console.

If you have difficulties resolving an issue that’s reported by a Health+ alert, contact Confluent Support for further help.

Alert severity levels¶

The following table describes the supported Health+ alert severity levels. Not every monitored metric has all alert levels.

State	Summary	Description
Critical	Highly recommended actions	Issues are present that may limit or prevent data from moving across your cluster. We recommend you address these with priority.
Warning	Potential future issues	These metrics are close to exceeding their normal operational range and may cause future issues. We recommend you review these metrics and their recommended actions.
Info	Informational events	Informational events on the normal operation of your cluster. We recommend you review these.

Premium Tier alerts¶

Disk Usage¶

Alerts when the total number of bytes for a disk volume crosses one of the specified thresholds.

Severity level	Severity threshold	Description
Critical	> 90% volume usage	Available disk volume is nearly exhausted.
Warning	> 70% volume usage	Volume usage is high.
Info	> 50% volume usage	Greater than 50% volume usage is still considered within normal operating range.

Fetch Request Latency¶

Having fetch request latency on your brokers is normal, but sizable increases in latency can mean it’s taking longer for your consumer clients to retrieve your messages.

Severity level	Severity threshold	Description
Warning	Average fetch request latency in the last 15 minutes is >50% above the average broker latency for the last 24 hours.	We’ve detected a 50% increase in fetch request latency on one or more of your brokers.
Info	Average fetch request latency in the last 15 minutes is >25% above the average broker latency for the last 24 hours.	We’ve detected a 25% increase in fetch request latency on one or more of your brokers.

Fetch Follower Request Latency¶

Severity level	Severity threshold	Description
Warning	Average fetch follower request latency in the last 15 minutes is >50% above the average broker latency for the last 24 hours.	We’ve detected a 50% increase in fetch follower request latency on one or more of your brokers.
Info	Average fetch follower request latency in the last 15 minutes is >25% above the average broker latency for the last 24 hours.	We’ve detected a 25% increase in fetch follower request latency on one or more of your brokers.

Network Processor Pool Usage¶

The network processor threads are responsible for reading and writing data to and from clients on the network.

Severity level	Severity threshold	Description
Critical	> 90% pool usage	At greater than 90% usage, the network processor pool is overloaded and client requests may take longer to complete.
Warning	> 70% pool usage	Greater than 70% usage is still considered within normal operating range for network processor pool usage, as threads are not computationally intensive.
Info	> 50% pool usage	Greater than 50% usage is still considered within normal operating range for the network processor pool.

Produce Request Latency¶

Having produce request latency on your brokers is normal, but sizable increases in latency can lead to longer wait time for your consumers to retrieve your messages.

Severity level	Severity threshold	Description
Warning	Average product request latency in the last 15 minutes is >50% above the average broker latency for the last 24 hours.	We’ve detected a 50% increase in produce request latency on one or more of your brokers.
Info	Average product request latency in the last 15 minutes is >25% above the average broker latency for the last 24 hours.	We’ve detected a 25% increase in produce request latency on one or more of your brokers.

Request Handler Pool Usage¶

The request handler threads are responsible for serving requests to and from clients, including reading and writing to disk.

Severity level	Severity threshold	Description
Critical	> 90% pool usage	At greater than 90% usage, the request handler is overloaded and client requests may take longer to complete.
Warning	> 70% pool usage	At greater than 70% usage, the request handler is approaching the overloaded range for this broker.
Info	> 50% pool usage	At greater than 50% usage the request handler is still considered within normal operating range.

Free Tier alerts¶

Active Controller Count¶

The Controller is responsible for maintaining the list of partition leaders and coordinating leadership transitions (topic creation).

In normal operation, a new Controller should be automatically selected if the current one becomes unavailable. Issues occur when this does not happen or more than one controller is active.

If the active controller count is less than 1, producers and consumers can’t get the partition leaders anymore.
If the active controller count is greater than 1, a split-brain situation occurs and may cause serious data integrity issues.

Severity level	Severity threshold	Description
Critical	Active Controller Count != 1 for more than 30 minutes	We detected a persistent abnormal state for Active Controller Count in this cluster.
Warning	Active Controller Count != 1 for 15 minutes	We detected an abnormal state for Active Controller Count in the current cluster.

Connector Is Degraded¶

One or more connector tasks have failed.

For a given connector, connector_failed_task_count is greater than 0 but is not equal to connector_total_task_count.

Severity level	Severity threshold	Description
Warning	Some Connector tasks have failed, but fewer than `connector_total_task_count`.	The connector is in a degraded state with failed tasks.

Connector Is Failed¶

All connector tasks have failed.

For a given connector, connector_failed_task_count is equal to connector_total_task_count.

Severity level	Severity threshold	Description
Critical	Connector tasks have failed, and `connector_failed_task_count` > 0.	The connector is in a failed state.

ksqlDB Error Queries¶

For a given ksqlDB engine, the number of queries generating errors is greater than 0 for longer than 1 minute.

Severity level	Severity threshold	Description
Critical	`error_queries` > 0	There are queries in error state in ksqlDB.

Offline Partitions¶

An offline partition is a partition that doesn’t have an active leader and therefore isn’t writable or readable. The presence of offline partitions compromises the data availability of the cluster.

Severity level	Severity threshold	Description
Critical	Offline Partitions > 0	The presence of offline partitions compromises the data availability of the cluster.

Unclean Leader Elections¶

An unclean leader election is a special case in which no available replicas are in sync. Because each topic must have a leader, an election is held among the out-of-sync replicas, and a leader is chosen, meaning any messages that weren’t synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability.

Severity level	Severity threshold	Description
Critical	Unclean Leader Elections > 0	Unclean leader elections sacrifice consistency for availability.

Under Replicated Partitions¶

Under replicated partitions can happen when a broker is down or can’t replicate fast enough from the leader (replica fetcher lag).

Severity level	Severity threshold	Description
Critical	Under Replicated Partitions > 0 for all data points in surveyed window	We’ve detected under replicated partitions on your cluster for an extended period.
Warning	Under Replicated Partitions > 0 for some data points in surveyed window	We’ve detected under replicated partitions on your cluster for a limited period.

Under Min In-Sync Replicas¶

If your partitions have fewer than min.insync.replicas, clients can’t produce, due to a NotEnoughReplicas exception. A producer may retry depending on its configuration. When you have a partition under the min ISR, data production is blocked.

Severity level	Severity threshold	Description
Warning	Under Min In Sync Replicas > 0	We’ve detected a partition under the min ISR, and data production is blocked.