Disaster Recovery and Failover

Looking for Confluent Platform Cluster Linking docs? This page describes Cluster Linking on Confluent Cloud. If you are looking for the Confluent Platform documentation, check out Cluster Linking on Confluent Platform.

Important

This feature is available as a preview feature. A preview feature is a component of Confluent Cloud that is being introduced to gain early feedback from developers. These features can be used for evaluation and non-production testing purposes or to provide feedback to Confluent. Comments, questions, and suggestions related to preview features are encouraged and can be submitted to clusterlinking@confluent.io.

To explore this use case, you will use Cluster Linking to fail over a topic and a simple command-line consumer and producer from an original cluster to a Disaster Recovery (“DR”) cluster.

Prerequisites

  • Make sure you have followed the steps under Commands and Prerequisites in the overview.
  • Your DR cluster must be a Dedicated cluster with Public internet endpoints.
  • Your Original cluster can be a Basic, Standard, or Dedicated cluster with Public internet endpoints. If you do not have these clusters already, you can create them in the Confluent Cloud UI or in the Confluent Cloud CLI.

Tip

You can use failover between an eligible Confluent Platform cluster and an eligible Confluent Cloud cluster. You will need to use the Confluent Cloud CLI for the Confluent Cloud cluster, and the Confluent CLI for the Confluent Platform cluster.

What the Tutorial Covers

This tutorial demos use of the Confluent Cloud CLI Cluster Linking commands to create a DR cluster and failover to it.

A REST API for Cluster Linking commands may be available in future releases.

You will start by building a cluster link to your DR cluster and mirroring all pertinent topics, ACLs, and consumer group offsets. This is your “steady state” setup.

../../_images/cluster-link-steady-state.png

Then you will incur a sample outage on the original cluster. When this happens, the producers, consumers, and cluster link will not be able to interact with the original cluster.

../../_images/cluster-link-outage.png

You will then call a failover command that converts the mirror topics on the DR cluster into regular topics.

../../_images/cluster-link-cutover-topics.png

Finally, you will move producers and consumers over to the DR cluster, and continue operations. The DR cluster has become our new source of truth.

../../_images/cluster-link-failover.png

Set up Steady State

Note down required information

For this walkthrough, you will need the following:

  • The cluster ID of your original cluster (<original-cluster-id> for purposes of this tutorial). To get cluster IDs on Confluent Cloud, type ccloud kafka cluster list at the Confluent Cloud CLI. In the examples below, the original cluster ID is lkc-xkd1g.

  • The bootstrap server for the original cluster. To get this, use the command ccloud kafka cluster describe <original-cluster-id>. Your output will resemble the following. In the example output below, the bootstrap server is the value: SASL_SSL://pkc-9kyp5.us-east-1.aws.confluent.cloud:9092. This value is referred to as <original-bootstrap-server> in this tutorial.

    +--------------+---------------------------------------------------------+
    | Id           | lkc-xkd1g                                               |
    | Name         | AWS US                                                  |
    | Type         | DEDICATED                                               |
    | Ingress      |                                                      50 |
    | Egress       |                                                     150 |
    | Storage      | Infinite                                                |
    | Provider     | aws                                                     |
    | Availability | single-zone                                             |
    | Region       | us-east-1                                               |
    | Status       | UP                                                      |
    | Endpoint     | SASL_SSL://pkc-9kyp5.us-east-1.aws.confluent.cloud:9092 |
    | ApiEndpoint  | https://pkac-rrwjk.us-east-1.aws.confluent.cloud        |
    | RestEndpoint | https://pkc-9kyp5.us-east-1.aws.confluent.cloud:443     |
    | ClusterSize  |                                                       1 |
    +--------------+---------------------------------------------------------+
    
  • The cluster ID of your DR cluster (<DR-cluster-id>). You can get this in the same way as your original cluster ID, with the command ccloud kafka cluster list.

  • The bootstrap server of your DR cluster (<DR-bootstrap-server>). You can get this in the same way as your original cluster bootstrap server, by using the describe command: ccloud kafka cluster describe <DR-cluster-id>.

Configure the source topic on the original cluster and mirror topic on the DR cluster

  1. Create a topic called dr-topic on the original cluster.

    For the sake of this demo, create this topic with only one partition (--partitions 1). Having only one partition will make it easier to notice how the consumer offsets are synced from original cluster to DR cluster.

    ccloud kafka topic create dr-topic --partitions 1 --cluster <original-cluster-id>
    

    You should see the message Created topic “dr-topic”, as shown in the example below.

    > ccloud kafka topic create dr-topic --partitions 1 --cluster lkc-xkd1g
    Created topic "dr-topic".
    
  2. Create a mirror topic of dr-topic on the DR cluster.

    ccloud kafka mirror create dr-topic --link <link-name> --cluster <DR-cluster-id>
    

    You should see the message Created mirror topic “dr-topic”, as shown in the example below.

    > ccloud kafka mirror create dr-topic --link dr-link --cluster lkc-r68yp
    Created mirror topic "dr-topic".
    

    Tip

    In the current preview release, you must create each mirror topic one-by-one with a CLI or API command. Future releases may provide a service to Cluster Linking that can automatically create mirror topics when new topics are created on your source cluster. In this case, you could filter the topics based on a prefix. That feature would be useful for automatically creating DR topics on a DR cluster.

At this point, you have a topic on your original cluster that is mirroring all of its data, ACLs, and consumer group offsets to a mirror topic on your DR cluster.

Produce and consume some data on the original cluster

In this section, you will simulate an application that is producing data to and consuming data from your original cluster. You will use the Confluent Cloud CLI to consume and produce functions to do so.

  1. Create a service account to represent your CLI based demo clients.

    ccloud service-account create CLI --description "From CLI"
    

    Your output should resemble:

    +-------------+-----------+
    | Id          |    254262 |
    | Resource ID | sa-ldr3w1 |
    | Name        | CLI       |
    | Description | From CLI  |
    +-------------+-----------+
    

    In the above example, the <cli-service-account-id> is 254262.

  2. Create an API key and secret for this service account on your original cluster, and save these as <original-CLI-api-key> and <original-CLI-api-secret>.

    ccloud api-key create --resource <original-cluster-id> --service-account <cli-service-account-id>
    
  3. Create an API key and secret for this service account on your DR cluster, and save these as <DR-CLI-api-key> and <DR-CLI-api-secret>.

    ccloud api-key create --resource <DR-cluster-id> --service-account <cli-service-account-id>
    
  4. Give your CLI service account enough ACLs to produce and consume messages on your Original cluster.

    ccloud kafka acl create --service-account <cli-service-account-id> --allow --operation READ --operation DESCRIBE --operation WRITE --topic "*" --cluster <original-cluster-id>
    ccloud kafka acl create --service-account <cli-service-account-id> --allow --operation DESCRIBE --operation READ --consumer-group "*" --cluster <original-cluster-id>
    

Now you can produce and consume some data on the original cluster.

  1. Tell your CLI to use your original API key on your original cluster.

    ccloud api-key use <original-cluster-api-key> --resource <original-cluster-id>
    
  2. Produce the numbers 1-5 to your topic

    seq 1 5 | ccloud kafka topic produce dr-topic --cluster <original-cluster-id>
    

    You should see this output, but the command should complete without needing you to press ^C or ^D.

    Starting Kafka Producer. ^C or ^D to exit
    
  3. Start a CLI consumer to read from the dr-topic topic, and give it the name cli-consumer.

    As a part of this command, pass in the flag --from-beginning to tell the consumer to start from offset 0.

    ccloud kafka topic consume dr-topic --group cli-consumer --from-beginning
    

    After the consumer reads all 5 messages, press Ctrl + C to quit the consumer.

  4. In order to observe how consumers pick up from the correct offset on a failover, artificially force some consumer lag on your consumer.

    Produce numbers 6-10 to your topic.

    seq 6 10 | ccloud kafka topic produce dr-topic
    

    You should see the following output, but the command should complete without needing you to press ^C or ^D.

    Starting Kafka Producer. ^C or ^D to exit
    

Now, you have produced 10 messages to your topic on your original cluster, but your cli-consumer has only consumed 5.

Monitoring mirroring lag

Because Cluster Linking is an asynchronous process, there may be mirroring lag between the source cluster and the destination cluster.

You can see what your mirroring lag is on a per-partition basis for your DR topic with this command:

ccloud kafka mirror describe dr-topic --link dr-link --cluster <dr-cluster-id>
  LinkName  | MirrorTopicName | Partition | PartitionMirrorLag | SourceTopicName | MirrorStatus | StatusTimeMs
+-----------+-----------------+-----------+--------------------+-----------------+--------------+---------------+
   dr-link |  dr-topic        |         0 |                  0 |  dr-topic       | ACTIVE       | 1624030963587
   dr-link |  dr-topic        |         1 |                  0 |  dr-topic       | ACTIVE       | 1624030963587
   dr-link |  dr-topic        |         2 |                  0 |  dr-topic       | ACTIVE       | 1624030963587
   dr-link |  dr-topic        |         3 |                  0 |  dr-topic       | ACTIVE       | 1624030963587
   dr-link |  dr-topic        |         4 |                  0 |  dr-topic       | ACTIVE       | 1624030963587
   dr-link |  dr-topic        |         5 |                  0 |  dr-topic       | ACTIVE       | 1624030963587

You can also monitor your lag and your mirroring metrics through the Confluent Cloud Metrics API. These two metrics are exposed:

  • MaxLag shows the maximum lag (in number of messages) among the partitions that are being mirrored. It is available on a per-topic and a per-link basis. This gives you a sense of how much data will be on the Original cluster only at the point of failover.
  • Mirroring Throughput shows on a per-link or per-topic basis how much data is being mirrored.

Simulate a failover to the DR cluster

In a disaster event, your original cluster is usually unreachable.

In this section, you will go through the steps you follow on the DR cluster in order to resume operations.

  1. Stop the mirror topic to convert it to a normal, writeable topic.

    ccloud kafka mirror failover <mirror-topic-name> --link <link-name> --cluster <DR-cluster-id>
    

    Expected output:

    MirrorTopicName | Partition | PartitionMirrorLag | ErrorMessage | ErrorCode
    -------------------------------------------------------------------------
    dr-topic        |         0 |                  0 |              |
    

    The stop command is irreversible. Once you change your mirror topic to a regular topic, you cannot change it back to a mirror topic. If you want it to be a mirror topic once again, you will need to delete it and recreate it as a mirror topic.

  2. Now you can produce and consume data on the DR cluster.

    Set your CLI to use the DR cluster’s API key:

    ccloud api-key use <DR-CLI-api-key> --resource <DR-cluster-id>
    
  3. Produce numbers 11-15 on the topic, to show that it is a writeable topic.

    seq 11 15 | ccloud kafka topic produce dr-topic --cluster <DR-cluster-id>
    

    You should see this output, but the command should complete without needing you to press ^C or ^D.

    Starting Kafka Producer. ^C or ^D to exit
    
  4. “Move” your consumer group to the DR cluster, and consume from dr-topic on the DR cluster.

    ccloud kafka topic consume dr-topic --group cli-consumer --cluster <DR-cluster-id>
    

    You should expect the consumer to start consuming at number 6, since that’s where it left off on the original cluster. If it does, that will show that its consumer offset was correctly synced. It should consume through number 15, which is the last message you produced on the DR cluster.

    After you see number 15, hit Ctrl + C to quit your consumer.

    Expected output:

    Starting Kafka Consumer. ^C or ^D to exit
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    ^CStopping Consumer.
    

You have now failed over your CLI producer and consumer to the DR cluster, where they continued operations smoothly.

Recovery after a disaster

If your original cluster experienced a temporary failure, like a cloud provider regional outage, then it may come back.

Recovering lagged data

Because Cluster Linking is asynchronous, you may have had data on the original cluster that did not make it over to the DR cluster at the time of failure.

  1. To recover this data, you first need to find the offsets at which the mirror topics were failed over. These are persisted on your DR cluster. You can call a CLI command or a REST API command to get this on a per topic basis. It will return the offset at which each partition was failed over. We are shipping this command to production in the coming weeks.
  2. With those offsets in hand, you can then point consumers at your original cluster, and reset their offsets to these. Your consumers can then recover the data and act on it as appropriate for their application.

Moving operations back to the original cluster

Most Confluent Cloud users choose to promote the DR cluster to their new main cluster, and keep operations running on it.

If you want to move operations back to your original cluster, here are the steps to do so:

  1. Recover lagged data, as described above.
  2. Pause your cluster link that was going to the DR cluster.
  3. Identify the topics that you wish to move from the DR cluster to the original cluster. If any of these topics still exist on the original cluster, you will need to delete them.
  4. Migrate from the DR cluster to your Original cluster as follows:
    1. Create a cluster link from the DR cluster to the original cluster. Be sure to sync consumer group offsets. Consider this a “reverse link”.
    2. Create mirror topics on the original cluster for each of these topics.
    3. Move your consumer groups from the DR cluster to the original cluster at will. Each time you move a consumer group, you should exclude it from the reverse-link’s consumer group offset sync.
    4. When you’ve moved your consumers and your mirroring lag is low, stop your producers. You can do this on a per topic basis. You do not need to move all topics at once.
    5. When your mirroring lag is 0 for these topics, call the promote command on the mirror topics that are on the Original cluster. This command checks to make sure there is no mirroring lag.
    6. Restart your producers, pointing at the topics on the Original cluster.
  5. Finally, when you’ve moved all of your topics, consumers, and producers, restore the DR relationship.
    1. Delete the reverse link.
    2. You can delete the topics on your DR cluster.
    3. Resume the original cluster link.
    4. Recreate mirror topics for the topics you wish to DR.

Suggested Resources

  • This tutorial covered a specific use case for disaster recovery. Share Data Across Topics provides a tutorial on topic data sharing, another basic use case for Cluster Linking.
  • Mirror Topics provides a concept overview of this feature of Cluster Linking.