System Health (deprecated view)

System Health provides insight into the health of the Apache Kafka® cluster from both a broker and topic-centric perspective.

Note

This view is deprecated and not available by default in Confluent Platform 5.3 and later versions. You can re-enable it by setting confluent.controlcenter.deprecated.views.enable=true in the appropriate control-center.properties file. RBAC cannot be enabled in Control Center if you want to reinstate the legacy views, which cannot be enforced with RBAC. The information presented in this view is also available in various locations of the current redesigned GUI with RBAC enforced.

UI Commonalities

Chart Tooltips
Each chart displays a similarly styled tooltip when hovering. These tooltips can display multiple metrics at the same time which are each paired with an icon. The icon will either be a (good) check mark or a (bad) X symbol at each point.
Table Metric Validations
Selected table metrics will visually change to indicate potential issues with a red underline. Hovering the mouse over text with a red underline displays an explanatory tooltip.

Broker Aggregate Metrics

Broker count
Total number of brokers in the cluster currently online and with Confluent Metrics Reporter enabled. If the broker count is less than the number of brokers in your cluster and you have under replicated topic partitions, then investigate the brokers that are missing from the broker table. Review the broker logs for WARN or ERROR messages.
ZooKeeper disconnected
At least one broker has disconnected from ZooKeeper in the last interval. If this is Yes, check network connectivity and latency between brokers and ZooKeeper, and verify ZooKeeper is running. Adjust the broker configuration parameter zookeeper.session.timeout.ms if needed.
Active controllers

The number of brokers in the cluster reporting as the active controller in the last interval.

During steady state there should be only one active controller per cluster. If this is greater than 1 for only one minute, then it probably means the active controller switched from one broker to another. If this persists for more than one minute, troubleshoot the cluster for "split brain".

Unclean elections

The number of unclean partition leader elections in the cluster reported in the last interval.

When unclean leader election is held among out-of-sync replicas, there is a possibility of data loss if any messages were not synced prior to the loss of the former leader. So if the number of unclean elections is greater than 0, investigate broker logs to determine why leaders were re-elected, and look for WARN or ERROR messages. Consider setting the broker configuration parameter unclean.leader.election.enable to false so that a replica outside of the set of in-sync replicas is never elected leader.

Network pool usage
Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor threads are not idle. If network pool usage is above 70%, investigate the production request latency and fetch request latency metrics to isolate where brokers are spending the most time. Consider increasing the broker configuration parameter num.network.threads.
Request pool usage
Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler threads are not idle. If request pool usage is above 70%, investigate the production request latency and fetch request latency metrics to isolate where brokers are spending the most time. Consider increasing the broker configuration parameter num.io.threads.
Disk usage
Disk usage distribution across all brokers in a cluster. Disk usage is determined to be skewed if the relative mean absolute difference of all broker sizes exceeds 10% and the difference between any two brokers is at least 1GB. If disk usage is skewed, consider rebalancing the cluster using the ref:Confluent Auto Data Balancer <rebalancer>.
Online topic partitions
Total number of topic partitions in the cluster that are online, i.e. on running brokers. If the number of topic partitions online is less than the number of topic partitions in your cluster, investigate topic metrics shown in the Topics tab.
Under replicated topic partitions

Total number of topic partitions in the cluster that are under-replicated; i.e., partition with number of in-sync replicas less than replication factor.

If the number of under replicated topic partitions is greater than 0, investigate topic metrics shown in the Topics tab and investigate the broker that is missing topic partitions. Review the broker logs for WARN or ERROR messages.

Offline topic partitions

Total number of topic partitions in the cluster that are offline. This can happen if the brokers with replicas are down, or if unclean leader election is disabled and the replicas are not in sync and thus none can be elected leader (may be desirable to ensure no messages are lost).

If the number of offline topic partitions is greater than 0, investigate topic metrics shown in the Topics tab and investigate the broker that is missing topic partitions. Review the broker logs for WARN or ERROR messages.

Produce and Fetch Charts

When clients write data to Kafka, they send produce requests to leader brokers which handle the produce requests and send responses back to the client. When clients read data from Kafka, they send fetch requests to leader brokers which handle the fetch requests and send responses back to the client. It is important to monitor the throughput and latency of these produce and fetch requests to ensure your cluster is performing optimally.

On each of the broker and topic tabs of the System Health view, you can see a summary of the produce and fetch requests.

The left side shows the produced metrics, and the right side shows the fetched metrics.

In either the Brokers and Topics tab, you can hover the mouse cursor over an individual row of the broker or topic table to overlay the request statistics for that individual broker or topic in the chart.

Hover over the top chart to see the request throughput for any given interval:

Bytes produced/fetched per sec
Total number of bytes per second of data produced to or fetched from this cluster.
Successful requests
Total number of successful requests produced to or fetched from the cluster in a one minute interval.
Failed requests
Total number of failed requests produced to or fetched from the cluster in a one minute interval.

Broker Metrics Table

ID
ID for this broker
Throughput

Bytes in - Number of bytes per second produced to this broker.

Bytes out – Number of bytes per second fetched from this broker (does not account for internal replication traffic).

Latency (produce)
Latency of produce requests to this broker at the median, 95th, 99th, or 99.9th percentile (in milliseconds).
Latency (fetch)
Latency of fetch requests to this broker at the median, 95th, 99th, or 99.9th percentile (in milliseconds).
Partition replicas
Total number of partition replicas served by this broker.
Segment
Total size in bytes of the log segments served by this broker (excluding index size).
Rack
Rack ID for this broker.

Topic Aggregate Metrics

Topic count
Total number of topics in the cluster. If the number of topics is less than expected, verify that clients are producing to or consuming from the topics that are missing.
In sync replicas

Total number of topic partition replicas in a cluster that are in sync with the leader; i.e., sum of each (topic partition * topic replication factor).

If the number of in sync replicas is less than the number of replicas in your cluster, identify the topic with out of sync replicas from the topic table and view details of the topic to determine which brokers are out of sync. Investigate the broker logs to determine cause.

Out of sync replicas
Total number of topic partition replicas in the cluster that are out of sync with the leader. If the number of out of sync replicas is greater than 0, identify the topic with out of sync replicas from the topic table and view details to determine which brokers are out of sync. Investigate the broker logs to determine cause.

Topic Metrics Table

Name
Topic name.
Throughput

Bytes in - Number of bytes per second produced to this topic.

Bytes out – Number of bytes per second fetched from this topic (does not account for internal replication traffic).

Partition replicas

Total – Total number of partition replicas for this topic.

In sync – Total number of partition replicas that are in sync.

Out of sync – Total number of partition replicas that are out of sync.

Partitions

Total – Number of partitions for this topic.

Under replicated – Number of partitions that are under replicated (i.e. partitions with in-sync replicas < replication factor).

Segment

Count – Number of log segments for this topic across all partition leaders.

Size – Size in bytes of the log for this topic (does not include replicas).

Offset

Start – Minimum offset across all partitions for this topic.

End – Maximum offset across all partitions for this topic.