Performance and Resource Usage Test Results

Several performance stress tests were run to determine efficiency, time, and resource usage of Self-Balancing for various use cases. The tests profile the Kafka controller node because that is the node on which Self-Balancing code will be running.

High level results of performance testing for the following common use cases is provided below:

  • Expand a cluster
  • Shrink a cluster
  • Failover to a new controller (lead broker)

Add brokers to expand a small cluster with a high partition count

The goal is to test a large number of replicas/partitions on a 5 broker cluster and then expand the cluster to verify that Self-Balancing automatically adds the brokers to the cluster and redistributes the partitions across all nodes.

Test Description

The test checks the impact of generating a rebalancing plan for “add broker” operations to incrementally expand a 5 broker cluster to 8 brokers. Once the plan is computed and submitted, Self-Balancing throttles and reassigns partitions in batches. The test validates that the reassignment completes and that it results in a balanced cluster with an evenly distributed workload on all 8 brokers.

Cluster Configurations

  • 5 brokers, expanded to 8 brokers in progressive tests on AWS
  • 90K replicas (30k partitions)
  • t2.2xlarge machines
  • 5 topics with 6000 partitions and replication factor of 3

Performance Results

  • Telemetry reporter metrics collection consumed well under 10% of CPU during rebalancing (adding brokers to expand cluster)
  • Self-Balancing took, on average, just over 1 minute (78 seconds) to generate algorithms and compute the “add broker” plan

Test scalability of a large cluster with many partitions

The goal is to test scalability of Self-Balancing on a 40 broker cluster (high end) with up to 1000 replicas per broker by expanding and then shrinking the cluster.

Test Description

The test assesses the impact of generating a rebalancing plan for “add broker” operations to expand the cluster from 39 brokers to 48 brokers with about 21,600 replicas on them.

The test then increases replicas to 2250 per broker (108,000 on 48 brokers) and checks the impact of generating a plan for “remove broker” operations to shrink the cluster.

Cluster Configurations

  • 39 brokers, expanded to 48 brokers in progressive tests
  • ~10MB incoming traffic per broker
  • Started with 21,600 replicas on 39 brokers for “add broker” test and 2250 replicas on 48 brokers (108,000 replicas) for “remove broker” test
  • r5.xlarge machines
  • Started with 4 topics with 1800 partitions and replication factor of 3 (21,600 replicas)

Performance Results

  • Performance results for “add broker” were in line with results on the other tests, so not repeated here.
  • CPU usage for metrics collection for “remove broker” operation averages between 1-to-3%, and most of the operation (with the exception of replica reassignment) completes within 60 seconds.
  • CPU usage for periodic metrics sampling performed by Self-Balancing was optimized to under 10%. Metrics sampling is performed by Self-Balancing on a regular basis, depending on how metric.sampling.interval.ms is configured (for example, every 3 minutes). For a normal cluster workload, producer throughput was 365MB/s and consumer throughput was 250MB/s.
  • Self-Balancing took, on average, 10 seconds to generate algorithms and compute the “remove broker” plan.

Repeatedly bounce the controller

The goal of this test is to verify that when a controller is lost, Kafka executes failover efficiently, and assigns a new controller with no disruption from Self-Balancing to controller startup.

Test Description

The test removes the Kafka controller several times to determine the impact of Self-Balancing failover on controller startup.

Self-Balancing runs on same node as Kafka controller, so controller failover also triggers Self-Balancing failover, which runs Self-Balancing startup code at the same time as controller startup code.

Cluster Configurations

  • 8 brokers
  • 90K replicas (30k partitions)
  • t2.2xlarge machines
  • 5 topics with 6000 partitions and replication factor of 3

Performance Results

  • CPU usage for controller failover averages 4.5%.
  • After receiving the failover event, Self-Balancing took 2 seconds to accomplish startup checks and get into a ready state on the new controller.
  • These results indicate that Self-Balancing failover is not overloading the controller for a high partition count setup.