Running ZooKeeper in Production

Apache Kafka® uses ZooKeeper to store persistent cluster metadata and is a critical component of the Confluent Platform deployment. For example, if you lost the Kafka data in ZooKeeper, the mapping of replicas to Brokers and topic configurations would be lost as well, making your Kafka cluster no longer functional and potentially resulting in total data loss.

This document provides the key considerations before making your ZooKeeper cluster live, but is not a complete guide for running ZooKeeper in production. For more detailed information, see the ZooKeeper Administrator’s Guide.

Supported version

Confluent Platform ships a stable version of ZooKeeper. You can use the ENVI “four letter word” to find the current version of a running server with nc. For example:

echo envi | nc localhost 2181

This will display all of the environment information for the ZooKeeper server, including the version.

Note

Note that the ZooKeeper start script and functionality of ZooKeeper is tested only with this version of ZooKeeper.

Hardware

A production ZooKeeper cluster can cover a wide variety of use cases. The recommendations are intended to be general guidelines for choosing proper hardware for a cluster of ZooKeeper servers. Not all use cases can be covered in this document, so consider your use case and business requirements when making a final decision.

Memory

In general, ZooKeeper is not a memory intensive application when handling only data stored by Kafka. The physical memory needs of a ZooKeeper server scale with the size of the znodes stored by the ensemble. This is because each ZooKeeper holds all znode contents in memory at any given time. For Kafka, the dominant driver of znode creation is the number of partitions in the cluster. In a typical production use case, a minimum of 4 GB of RAM should be dedicated for ZooKeeper use. Note that ZooKeeper is sensitive to swapping and any host running a ZooKeeper server should avoid swapping.

CPU

In general, ZooKeeper as a Kafka metadata store does not heavily consume CPU resources. However, if ZooKeeper is shared, or the host on which the ZooKeeper server is running is shared, CPU should be considered. ZooKeeper provides a latency sensitive function, so if it must compete for CPU with other processes, or if the ZooKeeper ensemble is serving multiple purposes, you should consider providing a dedicated CPU core to ensure context switching is not an issue.

Disks

Disk performance is vital to maintaining a healthy ZooKeeper cluster. Solid state drives (SSD) are highly recommended as ZooKeeper must have low latency disk writes in order to perform optimally. Each request to ZooKeeper must be committed to to disk on each server in the quorum before the result is available for read. A dedicated SSD of at least 64 GB in size on each ZooKeeper server is recommended for a production deployment. You can use autopurge.purgeInterval and autopurge.snapRetainCount to automatically cleanup ZooKeeper data and lower maintenance overhead.

JVM

ZooKeeper runs as a JVM. It is not notably heap intensive when running for the Kafka use case. A heap size of 1 GB is recommended for most use cases and monitoring heap usage to ensure no delays are caused by garbage collection.

Configuration Options

The ZooKeeper configuration properties file is located in /etc/kafka/zookeeper.properties.

ZooKeeper does not require configuration tuning for most deployments. Below are a few important parameters to consider. A complete list of configurations can be found in the ZooKeeper project page.

clientPort

This is the port where ZooKeeper clients will listen on. This is where the Brokers will connect to ZooKeeper. Typically this is set to 2181.

  • Type: int
  • Importance: required
dataDir

The directory where ZooKeeper in-memory database snapshots and, unless specified in dataLogDir, the transaction log of updates to the database. This location should be a dedicated disk that is ideally an SSD. For more information, see the ZooKeeper Administration Guide.

  • Type: string
  • Importance: required
dataLogDir

The location where the transaction log is written to. If you don’t specify this option, the log is written to dataDir. By specifying this option, you can use a dedicated log device, and help avoid competition between logging and snapshots. For more information, see the ZooKeeper Administration Guide.

  • Type: string
  • Importance: optional
tickTime

The unit of time for ZooKeeper translated to milliseconds. This governs all ZooKeeper time dependent operations. It is used for heartbeats and timeouts especially. Note that the minimum session timeout will be two ticks.

  • Type: int
  • Default: 3000
  • Importance: high
maxClientCnxns

The maximum allowed number of client connections for a ZooKeeper server. To avoid running out of allowed connections set this to 0 (unlimited).

  • Type: int
  • Default: 60
  • Importance: high
autopurge.snapRetainCount

When enabled, ZooKeeper auto purge feature retains the autopurge.snapRetainCount most recent snapshots and the corresponding transaction logs in the dataDir and dataLogDir respectively and deletes the rest.

  • Type: int
  • Default: 3
  • Importance: high
autopurge.purgeInterval

The time interval in hours for which the purge task has to be triggered. Set to a positive integer (1 and above) to enable the auto purging.

  • Type: int
  • Default: 0
  • Importance: high

Monitoring

ZooKeeper servers should be monitored to ensure they are functioning properly and proactively identify issues. In this section, a set of common monitoring best practices is discussed.

Operating System

The underlying OS metrics can help predict when the ZooKeeper service will start to struggle. In particular, you should monitor:

  • Number of open file handles – this should be done system wide and for the user running the ZooKeeper process. Values should be considered with respect to the maximum allowed number of open file handles. ZooKeeper opens and closes connections often, and needs an available pool of file handles to choose from.
  • Network bandwidth usage – because ZooKeeper keeps track of state, it is sensitive to timeouts caused by network latency. If the network bandwidth is saturated, you may experience hard to explain timeouts with client sessions that will make your Kafka cluster less reliable.

“Four Letter Words”

ZooKeeper responds to a set of commands, each four letters in length. The documentation gives a description of each one and what version each became available. To run the commands you must send a message (via netcat or telnet) to the ZooKeeper client port. For example echo stat | nc localhost 2181 would return the output of the STAT command to stdout.

You should monitor the STAT and MNTR four letter words. Each environment will be somewhat different, but you will be monitoring for any rapid increase or decrease in any metric reported here. The metrics reported by these commands should be consistent except for number of packets sent and received which should increase slowly over time.

JMX Monitoring

Confluent Control Center monitors the Broker to ZooKeeper connection as shown here.

The ZooKeeper server also provides a number of JMX metrics that are described in the project documentation here.

Here are a few JMX metrics that are important to monitor:

  • NumAliveConnections - make sure you are not close to maximum as set with maxClientCnxns
  • OutstandingRequests - should be below 10 in general
  • AvgRequestLatency - target below 10 ms
  • HeapMemoryUsage (Java built-in) - should be relatively flat and well below max heap size

In addition to JMX metrics ZooKeeper provides, Kafka tracks a number of relevant ZooKeeper events via the SessionExpireListener that should be monitored to ensure the health of ZooKeeper-Kafka interactions. For information about the metrics, see ZooKeeper Metrics.

  • ZooKeeperAuthFailuresPerSec (secure environments only)
  • ZooKeeperDisconnectsPerSec
  • ZooKeeperExpiresPerSec
  • ZooKeeperReadOnlyConnectsPerSec
  • ZooKeeperSaslAuthenticationsPerSec (secure environments only)
  • ZooKeeperSyncConnectsPerSec
  • SessionState

Audit Logging

ZooKeeper supports audit logging in version 3.6.0 and later. The Apache Log4j log utility provides the audit logging. You can enable audit logs for ZooKeeper by doing the following:

  1. Ensure that audit logging is enabled by adding audit.enable=true in the zookeepeer.properties file for each instance of ZooKeeper. A sample properties file can be found at $CONFLUENT_HOME\etc\kafka\. In newer versions of Confluent Platform, this entry may already be present.

  2. Add the following entries to the log4j.properties file found under $CONFLUENT_HOME\etc\kafka\. In newer versions of Confluent Platform, these entries may already be present.

    log4j.logger.org.apache.zookeeper.audit.Log4jAuditLogger=INFO, zkAuditAppender
    log4j.additivity.org.apache.zookeeper.audit.Log4jAuditLogger=false
    
    log4j.appender.zkAuditAppender=org.apache.log4j.DailyRollingFileAppender
    log4j.appender.zkAuditAppender.DatePattern='.'yyyy-MM-dd-HH
    log4j.appender.zkAuditAppender.File=${kafka.logs.dir}/zookeeper-audit.log
    log4j.appender.zkAuditAppender.layout=org.apache.log4j.PatternLayout
    log4j.appender.zkAuditAppender.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
    log4j.appender.zkAuditAppender.Threshold=INFO
    

Multi-node Setup

In a production environment, the ZooKeeper servers will be deployed on multiple nodes. This is called an ensemble. An ensemble is a set of 2n + 1 ZooKeeper servers where n is any number greater than 0. The odd number of servers allows ZooKeeper to perform majority elections for leadership. At any given time, there can be up to n failed servers in an ensemble and the ZooKeeper cluster will keep quorum. If at any time, quorum is lost, the ZooKeeper cluster will go down. A few general considerations for multi-node ZooKeeper ensembles:

  • Start with a small ensemble of three or five servers and only scale as truly necessary (or as required for fault tolerance). Each write must be propagated to a quorum of servers in the ensemble. This means that if you choose to run an ensemble of three servers, you are tolerant to one server being lost, and writes must propagate to two servers before they are committed. If you have five servers, you are tolerant to two servers being lost, but writes must propagate to three servers before they are committed.
  • Redundancy in the physical/hardware/network layout: try not to put them all in the same rack, decent (but don’t go nuts) hardware, try to keep redundant power and network paths, etc.
  • Virtualization: place servers in different availability zones to avoid correlated crashes and to make sure that the storage system available can accommodate the requirements of the transaction logging and snapshotting of ZooKeeper.
  • When it doubt, keep it simple. ZooKeeper holds important data, so prefer stability and durability over trying a new deployment model in production.

A multi-node setup does require a few additional configurations. There is a comprehensive overview of these in the project documentation. Below is a concrete example configuration to help get you started.

tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24

This example is for a 3 node ensemble. Note that this configuration file is expected to be identical across all members of the ensemble. Breaking this down, you can see tickTime, dataDir, and clientPort are all set to typical single server values. The initLimit and syncLimit are used to govern how long following ZooKeeper servers can take to initialize with the current leader and how long they can be out of sync with the leader. In this configuration, a follower can take 10000ms to initialize and may be out of sync for up to 4000ms based on the tickTime being set to 2000ms.

The server.* properties set the ensemble membership. The format is server.<myid>=<hostname>:<leaderport>:<electionport>. Some explanation:

  • myid is the server identification number. In this example, there are three servers, so each one will have a different myid with values 1, 2, and 3 respectively. The myid is set by creating a file named myid in the dataDir that contains a single integer in human readable ASCII text. This value must match one of the myid values from the configuration file. If another ensemble member has already been started with a conflicting myid value, an error will be thrown upon startup.
  • leaderport is used by followers to connect to the active leader. This port should be open between all ZooKeeper ensemble members.
  • electionport is used to perform leader elections between ensemble members. This port should be open between all ZooKeeper ensemble members.

The autopurge.snapRetainCount and autopurge.purgeInterval have been set to purge all but three snapshots every 24 hours.

Post Deployment

After you deploy your ZooKeeper cluster, ZooKeeper largely runs without much maintenance. The project documentation contains many helpful tips on operations such as managing cleanup of the dataDir, logging, and troubleshooting.

Migrating to KRaft

Starting with Confluent Platform version 7.0.0, KRaft, the planned replacement for ZooKeeper, is available in Preview. For details on how to get started, see Kafka Raft (KRaft) in Confluent Platform Preview.