Self-Balancing Clusters Demo (Docker)¶
This quick start demonstrates the capabilities of Self-Balancing Clusters on Confluent Server using Docker containers. Self-Balancing Clusters shift data across brokers to create even workloads according to a variety of distribution and performance goals. By the end of this demo, you will have successfully run Self-Balancing Clusters to rebalance data after adding and removing brokers.
About the Demo¶
Self-balancing simplifies the management of Kafka clusters in the following ways:
- When the load on the cluster is unevenly distributed, Self-Balancing automatically rebalances partitions to optimize performance
- When a new broker is added to the cluster, Self-Balancing automatically fills it with partitions
- When an operator wants to remove a broker, she can call a Self-Balancing API to shut down the broker and drain the partitions from it
- When a broker has been down for a certain amount of time, Self-Balancing will automatically reassign the partitions to other brokers
This quick start provides hands-on examples for each of the above scenarios.
Setup and What’s Included¶
This demo pulls the following Docker images:
- Confluent Control Center (Optional to also try out the UI in addition to command line.)
- Docker version 1.11 or later is installed and running.
- Docker Compose is installed. Docker Compose is installed by default with Docker for Mac.
- Docker memory is allocated minimally at 8 GB. When using Docker Desktop for Mac, the default Docker memory allocation is 2 GB. You can change the default allocation to 8 GB in Docker. Navigate to Preferences > Resources > Advanced.
- Internet connectivity
- Operating System currently supported by Confluent Platform
- Networking and Kafka on Docker
- Configure your hosts and ports to allow both internal and external components to the Docker network to communicate.
Start the Services¶
Clone the Confluent demo-scene repository from GitHub and work in the demo-scene/self-balancing subdirectory, which provides the sample code you will compile and run in this tutorial.
git clone example uses SSH, if your Git configuration is set for HTTPS, use
git clone https://github.com/confluentinc/demo-scene.git instead.
git clone email@example.com:confluentinc/demo-scene.git
Run the Demo Scenarios¶
The following exercises describe how to create various types of cluster imbalances modeled on typical, real-world scenarios. For each case, you will use Self-Balancing to monitor and auto balance the cluster.
Create a topic with topic data distributed unevenly across the brokers.
docker-composein the foreground to start ZooKeeper, 3 Confluent Platform brokers, and Control Center.
docker-compose -f kafka-0-1-2.yml up
Running this in the foreground allows you to monitor the logs for interesting information about the state of the cluster. Open a new command window to run the commands in the next steps.
Create a topic with explicit replica assignments that force a data distribution with no replicas on broker 2 to create an uneven load.
kafka-topics \ --bootstrap-server localhost:9092 \ --create \ --topic sbc \ --replica-assignment 0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1,0:1
Produce data at a rate of about 1MB/second per hour.
kafka-producer-perf-test \ --producer-props bootstrap.servers=localhost:9092 \ --topic sbc \ --record-size 1000 \ --throughput 1000 \ --num-records 3600000
Start watching changes in the topic.
watch kafka-topics \ --bootstrap-server localhost:9092 \ --describe \ --topic sbc
This watch will show changes in replica assignments, in-sync replicas, and relevant information such as replication throttling details.
Optionally, use Control Center to watch the same changes. Control Center is available at http://localhost:9021/ in your web browser.
Wait for Self-Balancing to start the rebalance.
Self-Balancing samples data about disk use, leader count, partition count, network throughput and other variables. It then aggregates this data to make informed decisions about how to reassign partitions.
- Self-Balancing needs between 10 and 20 minutes from startup to collect enough samples to generate a rebalancing plan (if one is needed).
- Self-Balancing invalidates previous samples when the number of partitions in the cluster changes materially since they may not accurately reflect the current state of the cluster.
While Self-Balancing is still sampling, the following message appears on the logs periodically:
INFO Skipping proposal precomputing because load monitor does not have enough snapshots.
After the sampling is completed, the rebalancing process begins.
Monitor the logs and the
watch kafka-topicscommand to observe the changes as they happen.
Add two more brokers to the cluster and watch Self-Balancing fill them with partitions.
docker-composein a new command window with the following files to add two brokers to the previous setup (do not stop any of the processes from the first part of the tutorial).
docker-compose -f kafka-0-1-2.yml -f kafka-3-4.yml up --no-recreate
Watch Self-Balancing rebalance the cluster.
Self-Balancing should be able to use the data it has already sampled, and the rebalance should kick off almost immediately.
The logs should reflect how Self-Balancing is also detecting the fact that you are rebalancing to a new broker, and not just the fact that the cluster is out of balance. This is important because Self-Balancing offers a configuration parameter called confluent.balancer.heal.uneven.load.trigger which you can set to either
EMPTY_BROKERwill only rebalance when Self-Balancing finds an empty broker (this expansion scenario)
ANY_UNEVEN_LOADwill rebalance when the load is uneven, regardless of the cause (the previous uneven load scenario).
Remove a broker from the cluster and watch Self-Balancing shut it down and drain the partitions.
Trigger a broker removal.
For this example, use REST API for Apache Kafka®.
First, the following
curlcommand will return a collection with all the clusters the REST API can address.
In this case, there is only one cluster. Note the value of
data.cluster_idas the ID of your cluster.
You have several options for triggering a broker removal.
The following command will trigger the removal of broker 3. (Replace
<cluster_id>in the example with the cluster ID from the previous step.)
curl -X DELETE localhost:8090/kafka/v3/clusters/:<cluster_id>/brokers/3
This should return
Alternatively, you can perform this step using kafka-remove-brokers, available as a part of Self-Balancing starting with Confluent Platform 6.0.0.
kafka-remove-brokers \ --bootstrap-server localhost:9092 \ --delete \ --broker-id 3
Finally, you can also remove a broker through the Confluent Control Center from the Brokers Overview page.
Watch Self-Balancing remove the broker.
Self-Balancing should be able to use the data it has already sampled, and the removal should kick off almost immediately. The removal, however, is a long-running process (because it involves data movement).
Run the following command to get a
broker_shutdown_statuswhile the removal operation is in progress. (Replace
<cluster_id>with your cluster ID.)
As an alternative monitoring option, you can plug in the
kafka-remove-brokerscommand. For example:
kafka-remove-brokers \ --bootstrap-server localhost:9092 \ --describe \ --broker-id 3
Simulate a broker failure and watch Self-Balancing address this condition.
Stop a container. For example, to stop broker 4:
docker container stop $(docker container ls -q --filter name=kafka4)
Watch Self-Balancing create new replicas.
Self-Balancing should be able to use the data it has already sampled, and the creation of new replicas should kick off almost immediately.
For this demo, confluent.balancer.heal.broker.failure.threshold.ms is set to
5000, meaning that Self-Balancing will consider the broker dead after only 5 seconds. This is not suitable for production environments, where typically timeouts should be set to between 30 minutes and to 1 hour (or whatever is most reasonable for your environment), but this is helpful here so that the demo runs faster.
When you are done working with Docker, you can stop and remove Docker containers and images as needed.
- Use docker container ls to view a list of all Docker running containers.
- Use docker container stop to stop running containers.
- Run docker images ls to list images.
- Run docker image rm to remove images you do not want to keep on your system.
You can rebuild and restart the containers at any time using the
docker-compose up -d command. (If you have deleted the images, this command will pull them down again.)
For more information, refer to the official Docker documentation.