Disaster Recovery using VolumeSnapshots with Confluent Manager for Apache Flink

An important part of any system is the ability to recover from a disaster. Typically, this is done by backing up and restoring data. This topic provides an overview of how to back up and restore Flink applications with Confluent Manager for Apache Flink® (CMF), based on your cloud provider.

To backup and restore CMF metadata, you use the Kubernetes Volume Snapshots feature. Google Cloud, AWS, and Azure all support Volume Snapshots.

A Volume Snapshot is a Kubernetes resource that captures the state of a persistent volume claim (PVC) at a specific point in time. For more about Volume Snapshots, see Kubernetes Volume Snapshots. To learn more about persistent volumes, see Kubernetes Persistent Volumes.

Creating a backup of CMF metadata and restoring from a backup varies by cloud provider. Following are prerequisites and details on how to configure backup and restore for each cloud provider.

Prerequisites

Note the following common terminology:

  • Custom Resource Definition (CRD) - A component to extend the Kubernetes API
  • Container Storage Interface (CSI) - An industry standard that enables storage systems to seamlessly integrate with container orchestration platforms like Kubernetes
  • Persistent volume claim (PVC) - A request for storage by a Kubernetes pod.
  • Volume snapshot - A Kubernetes resource that captures the state of a persistent volume claim (PVC) at a specific point in time.

Following are prerequisites based on your cloud provider:

AWS

Complete the following steps to enable Volume Snapshots on Amazon Elastic Kubernetes Service (EKS):

  1. Enable the EKS add-on for the Amazon Elastic Block Store (EBS) Container Storage Interface (CSI) driver.

  2. Configure an IAM Role for the driver’s controller. Attach the AmazonEBSCSIDriverPolicy managed policy to grant necessary permissions. For example, ec2:CreateSnapshot.

  3. Use the following commands to install the Container Storage Interface (CSI) snapshot controller components and custom resource definition (CRD):

    # CRDs
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
    # Controllers
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
    
  4. Create the storage class:

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: ebs-sc
    provisioner: ebs.csi.aws.com
    parameters:
      type: gp3
    reclaimPolicy: Retain
    volumeBindingMode: WaitForFirstConsumer
    
  5. Before installing CMF, ensure that the default storage class is set to ebs-sc.

    # Disable gp2 as the default storage class
    kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
    # Enable ebs-sc as the default storage class
    kubectl patch storageclass ebs-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
    

Google Cloud

  • The Google Cloud Persistent Disk CSI driver is a managed add-on and is enabled by default for auto-pilot clusters. You may need to enable it for standard clusters. For more information, see Using the Compute Engine persistent disk CSI Driver.
  • You do not need to install CRDs for volume snapshots. They are pre-installed.

Openshift on Azure

  • Openshift provides a snapshot controller operator (csi-snapshot-controller-operator), so you don’t need to install CRDs.

Create the snapshots

Use the following steps to create a snapshot.

  1. Install the VolumeSnapshotClass.

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotClass
    metadata:
    name: cmf-snapshot
    driver: <ebs.csi.aws.com/pd.csi.storage.gke.io/disk.csi.azure.com>
    deletionPolicy: Delete
    
  2. Create the snapshot.

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: first-snapshot
    spec:
      volumeSnapshotClassName: cmf-snapshot
      source:
        # The specified PVC should point to CMF PVC.
        persistentVolumeClaimName: confluent-manager-for-apache-flink-pvc
    

Restore from a volume snapshot

Follow these steps to restore CMF metadata from a volume snapshot.

Note

For Azure Openshift, you can also restore from the Openshift console by navigating to Storage → VolumeSnapshots. You can locate the snapshot, click the ellipses (…) and select the Restore as New PVC option. After the PVC is created on the Openshift cluster, you can continue to the next step.

  1. Install the VolumeSnapshot resource.

    Create a VolumeSnapshot resource referencing the SnapshotClass and SnapshotContent.

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: restored-snapshot
    spec:
      volumeSnapshotClassName: cmf-snapshot
      source:
        volumeSnapshotContentName: restored-vsc
    
  2. Create the VolumeSnapshotContent resource that references the snapshot handle.

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotContent
    metadata:
      name: restored-vsc
    spec:
      deletionPolicy: Retain
      driver: <ebs.csi.aws.com/pd.csi.storage.gke.io/disk.csi.azure.com>
      source:
      snapshotHandle: <snapshot-handle>
      volumeSnapshotClassName: cmf-snapshot
      volumeSnapshotRef:
      name: restored-snapshot
      namespace: confluent
    

    Note

    The snapshotHandle value can be obtained from your cloud provider’s UI such as the AWS Console.

  3. Create a restored PVC by using the restored volume snapshot.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
     name: pvc-restore
      namespace: confluent
    spec:
      storageClassName: ebs-sc
      dataSource:
       name: restored-snapshot
       kind: VolumeSnapshot
       apiGroup: snapshot.storage.k8s.io
      accessModes:
       - ReadWriteOnce
      resources:
       requests:
        storage: 20Gi
    
  4. Restore CMF.

    To restore CMF, create a YAML file and overwrite the following Helm values to mount the restored PVC.

    # restore-cmf.yaml
    ...
    mountedVolumes:
      volumes:
      - name: cmf-metadata
        persistentVolumeClaim:
        claimName: pvc-restore
      volumeMounts:
      - name: cmf-metadata
        mountPath: "/app/local"
    ...
    
  5. Install CMF through Helm with persistence set to false.

    helm upgrade --install cmf confluentinc/confluent-manager-for-apache-flink -n confluent -f cmf-restore.yml --set persistence.create=false