Apache Kafka’s Mirror Maker¶
MirrorMaker is a stand-alone tool for copying data between two Apache Kafka clusters. It is little more than a Kafka consumer and producer hooked together.
Data will be read from topics in the origin cluster and written to a topic with the same name in the destination cluster. You can run many such mirroring processes to increase throughput and for fault-tolerance (if one process dies, the others will take overs the additional load).
The origin and destination clusters are completely independent entities: they can have different numbers of partitions and the offsets will not be the same. For this reason the mirror cluster is not really intended as a fault-tolerance mechanism (as the consumer position will be different). The mirror maker process will, however, retain and use the message key for partitioning so order is preserved on a per-key basis.
Comparing Mirror Maker to Confluent Replicator¶
Lets start with an overview of features that exist in Replicator and MirrorMaker:
|Data replication||Real-time event streaming between Kafka clusters and data centers||Yes||Yes|
|Schema replication||Integrate with Confluent Schema Registry for multi-dc data quality and governance||Yes||Yes|
|Connect replication||Manage data integration across multiple data centers||Yes||Yes|
|Flexible topic selection||Select topics with white-lists, black-lists and regular expressions||Partial*||Yes|
|Auto-create topics||New topics are automatically detected and replicated||Partial*||Yes|
|Add new partitions||New partitions are automatically detected and replicated||No||Yes|
|Configuration replication||Identical topic configuration between the two clusters||No||Yes|
|Single Message Transformations||Filter, modify and route events on the fly||No||Yes|
|Auto-scale||Scale replication processes as Kafka traffic increases with a single configuration.||No||Yes|
|Active-active replication||Redirect events to avoid infinite replication loops in active-active configurations||No||Yes|
|Aggregate cluster||One management point for replicating more than a single cluster||No||Yes|
|Control Center integration||Manage and monitor replication via Control Center UI||No||Yes|
* MirrorMaker only supports regular expressions for topic selection, not lists. MirrorMaker can auto-create topics if the destination Kafka cluster is configured for auto topic creation, but these topics will be created with default configuration - which may or may not be correct.
As you can see, it is a long list, but there are three primary buckets to consider and help you decide if MirrorMaker or Replicator is right for you:
- Replication Control
- Management and monitoring
- Replication patterns
Greater control for reliable replication¶
One of the first differences you’ll run into is the fundamental control of replication. MirrorMaker and Replicator both guarantee message replication to a destination datacenter, but replicator expands upon message delivery to fill gaps production deployments often run into with MirrorMaker. As mentioned above, MirrorMaker by design does not propagate non-message changes to your origin datacenter. You will need to manually keep your topics, replication factors, and partition configurations in sync, which means the configuration between the two clusters can easily diverge due to human error. In contrast, Replicator will keep your data, number of partitions and topic configuration in sync. In addition, you can configure Replicator to correctly create new topics in the destination cluster whenever a topic is added on the origin cluster. If you wish to modify the events while replicating them (for example, filter our certain events, filter out sensitive fields or route some events to different topics), Replicator integrates with Kafka’s simple message transformations and can perform those modifications.
Deploying, managing, and monitoring¶
The second factor to consider is how are you going to keep MirrorMaker or Replicator running. As you’ll see in the example below, MirrorMaker comes with a complex configuration spread across multiple files, requires keeping those configs in sync across instances, and is yet another service to maintain and scale. Comparatively, Replicator helps avoid and alleviate all these types of issues. Scalability and fault tolerance baked in by leveraging Kafka’s Connect API, and the same API also provides central point for configuration - so you only need to issue a configuration change once through the API and all replication tasks will immediately reflect the modification. If you are using Confluent Control Center, then you can use it to configure, deploy, and monitor Multi-Datacenter Replication with ease. You can monitor MirrorMaker with various open source and generic tools, monitoring high level metrics: availability and replication lag. Control Center provides end-to-end latency monitoring - from producer on origin cluster to consumer at the destination cluster, and everything in-between.
The third factor is how easy it is to implement your chosen replication patterns. Active-active replication can be a bit tricky because bi-directional replication of topic A in one cluster to topic A on another will cause an infinite loop. Replicator supports dynamically adding topic prefixes to replicated events, so topic A in one cluster will be replicated to topic C1.A and topic A in the other cluster will be replicated to topic C2.A. This is a well-known design pattern that is built into Confluent Replicator to make implementation easier.
Another example is the aggregator pattern - if you want to replicate from multiple clusters into one (also called hub-and-spokes pattern), Mirror Maker will require installing multiple instances for each origin cluster. This can quickly become challenging to maintain. Replicator can support multiple origin clusters from one Connect cluster and can be far easier to deploy, configure and monitor.
MirrorMaker is open source under the Apache License. Confluent Replicator is proprietary and is included with the Confluent support subscription.
Example of Mirror Maker Use¶
Here is an example showing how to mirror a single topic (named my-topic) from two input clusters:
$ bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config consumer-1.properties \ --consumer.config consumer-2.properties --producer.config producer.properties \ --whitelist my-topic
Note that we specify the list of topics with the
--whitelist option. This option allows any regular expression using Java-style regular expressions. So you could mirror two topics named A and B using
--whitelist 'A|B'. Or you could mirror all topics using
--whitelist '.*'. Make sure to quote any regular expression to ensure the shell doesn’t try to expand it as a file path. For convenience we allow the use of ‘,’ instead of ‘|’ to specify a list of topics.
Sometimes it is easier to say what it is that you don’t want. Instead of using
--whitelist to say what you want to mirror you can use
--blacklist to say what to exclude. This also takes a regular expression argument.
Combining mirroring with the configuration
auto.create.topics.enable=true makes it possible to have a replica cluster that will automatically create and replicate all data in an origin cluster even as new topics are added. These topics will be created with default configuration - which may or may not be correct.