Performance and Resource Usage for Self-Balancing in Confluent Platform

Self-Balancing automatically balances partitions and data across the brokers in a Kafka cluster. Performance stress tests measure the efficiency, time, and resource usage of Self-Balancing for common use cases. The tests profile the Kafka controller node because that is the node on which Self-Balancing code runs.

High-level results of performance testing for the following common use cases are provided below:

Expand a cluster
Shrink a cluster
Failover to a new controller

Add brokers to expand a small cluster with a high partition count

Self-Balancing automatically adds new brokers and redistributes partitions across all nodes. This test starts with a five-broker cluster that has a high replica and partition count, then expands it to verify that rebalancing happens automatically.

Test description: expand cluster

The test checks the impact of generating a rebalancing plan for “add broker” operations to incrementally expand a five-broker cluster to eight brokers. After the plan is computed and submitted, Self-Balancing throttles and reassigns partitions in batches. The test validates that the reassignment completes and that it results in a balanced cluster with an evenly distributed workload on all eight brokers.

Cluster configuration: expand cluster

Five brokers, expanded to eight brokers in progressive tests on AWS
90,000 replicas (30,000 partitions)
t2.2xlarge machines
Five topics with 6,000 partitions and replication factor of 3

Performance results: expand cluster

Telemetry reporter metrics collection consumed well under 10% of CPU during rebalancing (adding brokers to expand cluster)
Self-Balancing took, on average, over 1 minute (78 seconds) to generate algorithms and compute the “add broker” plan

Test scalability of a large cluster with many partitions

Self-Balancing scales to large clusters with many partitions. This test runs Self-Balancing on a roughly 40-broker cluster, expanding and then shrinking the cluster.

Test description: scalability test

The test assesses the impact of generating a rebalancing plan for “add broker” operations to expand the cluster from 39 brokers to 48 brokers with about 21,600 replicas on them.

The test then increases replicas to 2,250 per broker (108,000 on 48 brokers) and checks the impact of generating a plan for “remove broker” operations to shrink the cluster.

Cluster configuration: scalability test

39 brokers, expanded to 48 brokers in progressive tests
~10 MB incoming traffic per broker
Started with 21,600 replicas on 39 brokers for “add broker” test and 2,250 replicas per broker (108,000 on 48 brokers) for “remove broker” test
r5.xlarge machines
Started with four topics with 1,800 partitions and replication factor of 3 (21,600 replicas)

Performance results: scalability test

Performance results for “add broker” were in line with results on the other tests, so they are not repeated here.
CPU usage for metrics collection during a “remove broker” operation averages 1% to 3%. Most of the operation completes within 60 seconds, excluding replica reassignment.
Self-Balancing metrics sampling uses under 10% CPU. Self-Balancing samples metrics on a regular interval. For a normal cluster workload, producer throughput was 365 MB per second and consumer throughput was 250 MB per second.
Self-Balancing took, on average, 10 seconds to generate algorithms and compute the “remove broker” plan.

Repeatedly bounce the controller

When a controller is lost, Kafka fails over to a new controller efficiently, and Self-Balancing does not disrupt controller startup. This test repeatedly removes the Kafka controller to measure that impact.

Test description: controller failover

The test removes the Kafka controller multiple times to determine the impact of Self-Balancing failover on controller startup.

Self-Balancing runs on same node as Kafka controller, so controller failover also triggers Self-Balancing failover, which runs Self-Balancing startup code at the same time as controller startup code.

Cluster configuration: controller failover

Eight brokers
90,000 replicas (30,000 partitions)
t2.2xlarge machines
Five topics with 6,000 partitions and replication factor of 3

Performance results: controller failover

CPU usage for controller failover averages 4.5%.
After receiving the failover event, Self-Balancing took 2 seconds to complete startup checks and get into a ready state on the new controller.
These results show that Self-Balancing failover is not overloading the controller for a high partition count setup.