Multiple Availability Zones Deployment of Confluent

Confluent for Kubernetes provides deployment, management, and operational support for Confluent components on Kubernetes.

Most Kubernetes offerings in a public cloud support running in a regional configuration, where the Kubernetes cluster is spread across multiple availability zones (multi-AZ).

This topic describes the behavior and practices for Confluent for Kubernetes in multi-AZ deployments.

Note that the examples in this topic does not specify a namespace, assuming you set your namespace as the current namespace.

In addition to deploying Confluent for Kubernetes across multiple AZs, Confluent for Kubernetes can also be configured for Rack Awareness, where the racks are AZs. With Rack Awareness, Kafka partition replicas will be placed across AZs to minimize data loss in the event of an AZ failure. To configure Rack Awareness, see configuring Rack Awareness for AZs tutorial.

Multi-AZ Kubernetes

In a multi-AZ Kubernetes, the Kubernetes worker nodes are spread across AZs. Each node is labeled with the region that it is in. To check the AZ labels:

kubectl get nodes --show-labels

An output from the above command includes the AZ label, failure-domain.beta.kubernetes.io/zone. For example:

failure-domain.beta.kubernetes.io/zone=us-central1-f

The storage used by worker nodes are available across all AZs.

Confluent pods scheduling

The Confluent component instance pods are scheduled across the AZs in a round-robined fashion.

For example, if you schedule 6 Kafka brokers to 3 AZs (each AZ with 2 worker nodes), each AZ ends up with 2 Kafka brokers.

If the AZ does not have enough resources/worker capacity, the pods are scheduled on any worker in a zone that has capacity.

For example, on a Kubernetes cluster with the following worker node spread:

  • 4 workers in AZ 1
  • 1 worker in AZ 2
  • 4 workers in AZ 3

If you schedule 6 Kafka brokers, the pods would scheduled as one of the following:

  • 2 broker pods in AZ 1, 1 broker pod in AZ 2, 3 broker pods in AZ 3
  • 3 broker pods in AZ 1, 1 broker pod in AZ 2, 2 broker pods in AZ 3

Once deployed, you can check which nodes your pods are scheduled to:

  1. Get the node names of the pods:

    kubectl get pods -o wide
    
  2. For each node, check which AZ the node is in by looking at the failure-domain.beta.kubernetes.io/zone label of the node:

    kubectl get nodes --show-labels | grep <node_name>
    

Storage scheduling

Confluent for Kubernetes uses persistent volumes for each component’s storage needs. In most cloud providers, the persistent volume storage mechanism is AZ aware, and persistent volumes cannot be accessed or moved across AZs. Thus, pods that are scheduled in a specific AZ must have their persistent volume in the same AZ.

The CFK requires lazily bound storage classes on Kubernetes clusters that are spread across multi-AZs. Ensure that the storage class you use has volumeBindingMode set to WaitForFirstConsumer. This delays the binding and provisioning of the PersistentVolume until a pod using the PersistentVolumeClaim is created. PersistentVolumes are selected or provisioned conforming to the topology that is specified by the Pod’s scheduling constraints.

Check the binding mode of a storage class with the following command:

kubectl describe sc <your_storage_class>

The output should have VolumeBindingMode set to WaitForFirstConsumer as shown below:

Name:                  <your_storage_class>
...
VolumeBindingMode:     WaitForFirstConsumer
...

Once deployed, you can check to see which AZ the persistent volume is assigned to. For example:

kubectl describe pv pvc-008a81fd-cd28-4b3e-a26f-5fe9384476f9

A sample output of the above command that show the persisten volume is mounted in the us-central1-a zone:

Name:        pvc-008a81fd-cd28-4b3e-a26f-5fe9384476f9
Labels:      failure-domain.beta.kubernetes.io/region=us-central1
             failure-domain.beta.kubernetes.io/zone=us-central1-a

Scenario: Worker failure

When a single Kubernetes worker node fails, the pods on that worker node disappear.

In that case, CFK will schedule the Confluent pod on another Kubernetes worker with the required resource capacity in the same AZ, and will attach to the same persistent volume.

When there is no available worker node with the required resource capacity, the pod will not be scheduled, and it will be in the pending state.

When resource capacity becomes available in the zone, the pod will be scheduled.

Note

Confluent for Kubernetes uses the oneReplicaPerNode feature to ensure that Kafka brokers are not co-located on the same Kubernetes worker node, and the same for ZooKeeper servers. Therefore, you need to have enough Kubernetes worker nodes to accommodate Kafka and ZooKeeper.

Scenario: AZ failure

When an entire AZ fails, all the Kubernetes worker nodes in that AZ are unavailable. The storage is resilient within the AZ, and thus is not available as well.

In this case, the pods and the persistent volumes are not available, and they cannot be brought up automatically. They will be in the pending state.

There are two paths that you can take to resolve the issue: waiting for AZ to become available or scheduling pods to a different AZ.

Wait for AZ to become available

When you wait for the AZ to be available, the specific pods in that AZ will remain in the pending state until the AZ is available. When the AZ is back up, those pods will be brought up.

In this case, make sure that there is enough broker capacity to operate in case of an AZ being down. The replication factor, and ISR (In-Sync Replicas) should be set to work with the number of brokers available when one AZ is out.

Schedule Confluent pods to a different AZ

In this path, the Kafka broker pod will start up with fresh storage, and will replicate partitions from other brokers before it can start accepting read/write requests.

The example commands below use kafka as the Kafka cluster name.

To schedule pods to a different AZ:

  1. Set CFK to temporarily not update resource reconciliation:

    kubectl annotate kafka kafka platform.confluent.io/block-reconcile=true
    
  2. Delete the persistent volume claim (pvc).

    The pvc is named as “data0-<clustername>-0”.

    For example, if the pod that is on the failed AZ is “kafka-0”, and the Kafka cluster name is kafka, the command would be:

    kubectl patch pvc data0-kafka-0 \
       -p '{"metadata":{"finalizers":[]}}' \
       --type=merge \
       | kubectl delete pvc data0-kafka-0
    
  3. Delete the pods that were associated with the unavailable AZ:

    kubectl delete pod kafka-0 --grace-period=0 --force
    
  4. Re-enable resource reconciliation and trigger a reconcile. This will schedule the pod on an available AZ:

    kubectl annotate kafka kafka platform.confluent.io/block-reconcile-