Health+ Intelligent Alerts¶
Health+ actively monitors the performance and configuration of your Confluent Platform services. With Intelligent Alerts, you get a set of notifications to alert you to potential environmental issues before they become critical problems. Use Intelligent Alerts to be notified of unfavorable operational conditions before they lead to costly downtime or service outage.
Health+ provides a set of commonly used alerts with the free tier, and an extended set of alerts with the premium tier.
Set up notifications for your Confluent Platform deployment on the Intelligent alerts page in Confluent Cloud Console.
Tip
If you have difficulties resolving an issue that’s reported by a Health+ alert, contact Confluent Support for further help.
Alert severity levels¶
The following table describes the supported Health+ alert severity levels. Not every monitored metric has all alert levels.
State | Summary | Description |
---|---|---|
Critical | Highly recommended actions | Issues are present that may limit or prevent data from moving across your cluster. We recommend you address these with priority. |
Warning | Potential future issues | These metrics are close to exceeding their normal operational range and may cause future issues. We recommend you review these metrics and their recommended actions. |
Info | Informational events | Informational events on the normal operation of your cluster. We recommend you review these. |
Free Tier alerts¶
Active Controller Count¶
The Controller is responsible for maintaining the list of partition leaders and coordinating leadership transitions (topic creation).
In normal operation, a new Controller should be automatically selected if the current one becomes unavailable. Issues occur when this does not happen or more than one controller is active.
- If the active controller count is less than 1, producers and consumers can’t get the partition leaders anymore.
- If the active controller count is greater than 1, a split-brain situation occurs and may cause serious data integrity issues.
Severity level | Severity threshold | Description |
---|---|---|
Critical | Active Controller Count != 1 for more than 30 minutes | We detected a persistent abnormal state for Active Controller Count in this cluster. |
Warning | Active Controller Count != 1 for 15 minutes | We detected an abnormal state for Active Controller Count in the current cluster. |
Connector Is Degraded¶
One or more connector tasks have failed.
For a given connector, connector_failed_task_count
is greater than 0 but
is not equal to connector_total_task_count
.
Severity level | Severity threshold | Description |
---|---|---|
Warning | Some Connector tasks have failed, but fewer than connector_total_task_count . |
The connector is in a degraded state with failed tasks. |
Connector Is Failed¶
All connector tasks have failed.
For a given connector, connector_failed_task_count
is equal to
connector_total_task_count
.
Severity level | Severity threshold | Description |
---|---|---|
Critical | Connector tasks have failed, and connector_failed_task_count > 0. |
The connector is in a failed state. |
Consumer Lag¶
Having consumer lag on your topics is normal, but sizable increases in lag can mean it’s taking longer for your consumers’ clients to retrieve your messages.
Check your producer throughput metrics to ensure that consumer lag is within your expected operating SLAs.
Severity level | Severity threshold | Description |
---|---|---|
Warning | Consumer lag increased by 75% | Average consumer lag in the last 15 minutes is > 75% of average consumer lag over the last 24 hours. |
Info | Consumer lag increased by 50% | Average consumer lag in the last 15 minutes is > 50% of average consumer lag over the last 24 hours |
ksqlDB Error Queries¶
For a given ksqlDB engine, the number of queries generating errors is greater than 0 for longer than 1 minute.
Severity level | Severity threshold | Description |
---|---|---|
Critical | error_queries > 0 |
There are queries in error state in ksqlDB. |
Offline Partitions¶
An offline partition is a partition that doesn’t have an active leader and therefore isn’t writable or readable. The presence of offline partitions compromises the data availability of the cluster.
Severity level | Severity threshold | Description |
---|---|---|
Critical | Offline Partitions > 0 | The presence of offline partitions compromises the data availability of the cluster. |
Unclean Leader Elections¶
An unclean leader election is a special case in which no available replicas are in sync. Because each topic must have a leader, an election is held among the out-of-sync replicas, and a leader is chosen, meaning any messages that weren’t synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability.
Severity level | Severity threshold | Description |
---|---|---|
Critical | Unclean Leader Elections > 0 | Unclean leader elections sacrifice consistency for availability. |
Under Replicated Partitions¶
Under replicated partitions can happen when a broker is down or can’t replicate fast enough from the leader (replica fetcher lag).
Severity level | Severity threshold | Description |
---|---|---|
Critical | Under Replicated Partitions > 0 for all data points in surveyed window | We’ve detected under replicated partitions on your cluster for an extended period. |
Warning | Under Replicated Partitions > 0 for some data points in surveyed window | We’ve detected under replicated partitions on your cluster for a limited period. |
Under Min In-Sync Replicas¶
If your partitions have fewer than min.insync.replicas
, clients can’t
produce, due to a NotEnoughReplicas
exception. A producer may retry
depending on its configuration. When you have a partition under the min ISR,
data production is blocked.
Severity level | Severity threshold | Description |
---|---|---|
Critical | Under Min In Sync Replicas > 0 | We’ve detected a partition under the min ISR, and data production is blocked. |