.. _zk-prod-deployment:

==========================
Running |zk| in Production
==========================

|ak-tm| uses |zk| to store persistent cluster metadata and is a critical component of the Confluent
Platform deployment. For example, if you lost the
`Kafka data in ZooKeeper <https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper>`_, the mapping of replicas to
Brokers and topic configurations would be lost as well, making your |ak| cluster no longer functional
and potentially resulting in total data loss.

This document  provides the key considerations before making your |zk| cluster live, but is not a complete
guide for running |zk| in production. For more detailed information, see the
`ZooKeeper Administrator's Guide <https://zookeeper.apache.org/doc/r3.4.10/zookeeperAdmin.html>`_.

.. _kafka_zookeeper_deployment:

Supported version
~~~~~~~~~~~~~~~~~

|cp| ships a stable version of |zk|. You can use the ENVI
"four letter word" to find the current version of a running server with ``nc``.
For example:

.. codewithvars:: bash

  echo envi | nc localhost 2181

This will display all of the environment information for the |zk| server, including the version.

.. note::

  Note that the |zk| start script and functionality of |zk| is tested only with this version of |zk|.

Hardware
~~~~~~~~

A production |zk| cluster can cover a wide variety of use cases. The recommendations are intended to be
general guidelines for choosing proper hardware for a cluster of |zk| servers. Not all use cases can be
covered in this document, so consider your use case and business requirements when making
a final decision.

~~~~~~
Memory
~~~~~~

In general, |zk| is not a memory intensive application when handling only data stored by |ak|. The physical
memory needs of a |zk| server scale with the size of the znodes stored by the ensemble. This is because
each |zk| holds all znode contents in memory at any given time. For |ak|, the dominant driver of znode creation
is the number of partitions in the cluster. In a typical production use case, a minimum of 8 GB of RAM should be
dedicated for |zk| use. Note that |zk| is sensitive to swapping and any host running a |zk| server
should avoid swapping.

~~~
CPU
~~~

In general, |zk| as a |ak| metadata store does not heavily consume CPU resources. However, if |zk|
is shared, or the host on which the |zk| server is running is shared, CPU should be considered. |zk| provides
a latency sensitive function, so if it must compete for CPU with other processes, or if the |zk| ensemble is serving
multiple purposes, you should consider providing a dedicated CPU core to ensure context switching is not an issue.

~~~~~
Disks
~~~~~

Disk performance is vital to maintaining a healthy |zk| cluster. Solid state drives (SSD) are highly recommended as
|zk| must have low latency disk writes in order to perform optimally. Each request to |zk| must be committed to
to disk on each server in the quorum before the result is available for read. A dedicated SSD of at least 64 GB
in size on each |zk| server is recommended for a production deployment. You can use ``autopurge.purgeInterval`` and
``autopurge.snapRetainCount`` to automatically cleanup |zk| data and lower maintenance overhead.

JVM
~~~

|zk| runs as a JVM. It is not notably heap intensive when running for the |ak| use case. A heap size of 1 GB is recommended
for most use cases and monitoring heap usage to ensure no delays are caused by garbage collection.

.. _zk-prod-config:

Configuration Options
~~~~~~~~~~~~~~~~~~~~~

The |zk| configuration properties file is located in ``/etc/kafka/zookeeper.properties``.

|zk| does not require configuration tuning for most deployments. Below are a few important parameters to consider.
A complete list of configurations can be found in the `ZooKeeper project page <https://zookeeper.apache.org/doc/r3.4.10/zookeeperAdmin.html#sc_configuration>`_.

``clientPort``
  This is the port where |zk| clients will listen on. This is where the Brokers will connect to |zk|.
  Typically this is set to 2181.

  * Type: int
  * Importance: required

``dataDir``
  The directory where |zk| in-memory database snapshots and, unless specified in ``dataLogDir``, the transaction log of
  updates to the database. This location should be a dedicated disk that is ideally an SSD. For more information, see
  the `ZooKeeper Administration Guide <https://zookeeper.apache.org/doc/current/zookeeperAdmin.html>`_.

  * Type: string
  * Importance: required

``dataLogDir``
  The location where the transaction log is written to. If you don't specify this option, the log is written to dataDir.
  By specifying this option, you can use a dedicated log device, and help avoid competition between logging and snapshots.
  For more information, see the `ZooKeeper Administration Guide <https://zookeeper.apache.org/doc/current/zookeeperAdmin.html>`_.

  * Type: string
  * Importance: optional


``tickTime``
  The unit of time for |zk| translated to milliseconds. This governs all |zk| time dependent operations.
  It is used for heartbeats and timeouts especially. Note that the minimum session timeout will be two ticks.

  * Type: int
  * Default: 2000
  * Importance: high

``maxClientCnxns``
  The maximum allowed number of client connections for a |zk| server. To avoid running out of allowed connections
  set this to 0 (unlimited).

  * Type: int
  * Default: 60
  * Importance: high

``autopurge.snapRetainCount``
  When enabled, |zk| auto purge feature retains the autopurge.snapRetainCount most recent snapshots and the
  corresponding transaction logs in the dataDir and dataLogDir respectively and deletes the rest.

  * Type: int
  * Default: 3
  * Importance: high

``autopurge.purgeInterval``
  The time interval in hours for which the purge task has to be triggered. Set to a positive integer (1 and above) to enable the auto purging.

  * Type: int
  * Default: 0
  * Importance: high

Monitoring
~~~~~~~~~~

|zk| servers should be monitored to ensure they are functioning properly and proactively identify issues.
In this section, a set of common monitoring best practices is discussed.

~~~~~~~~~~~~~~~~
Operating System
~~~~~~~~~~~~~~~~

The underlying OS metrics can help predict when the |zk| service will start to struggle.  In particular,
you should monitor:

* Number of open file handles -- this should be done system wide and for the user running the |zk| process. Values should be considered with respect to the maximum allowed number of open file handles. |zk| opens and closes connections often, and needs an available pool of file handles to choose from.
* Network bandwidth usage -- because |zk| keeps track of state, it is sensitive to timeouts caused by network latency.  If the network bandwidth is saturated, you may experience hard to explain timeouts with client sessions that will make your |ak| cluster less reliable.

~~~~~~~~~~~~~~~~~~~
"Four Letter Words"
~~~~~~~~~~~~~~~~~~~

|zk| responds to a set of commands, each four letters in length.  The `documentation <https://zookeeper.apache.org/doc/r3.4.10/zookeeperAdmin.html#sc_zkCommands>`__
gives a description of each one and what version each became available.  To run the commands you must
send a message (via ``netcat`` or ``telnet``) to the |zk| client port.  For example ``echo stat | nc localhost 2181``
would return the output of the ``STAT`` command to stdout.

You should monitor the ``STAT`` and ``MNTR`` four letter words.  Each environment will be somewhat different,
but you will be monitoring for any rapid increase or decrease in any metric reported here.
The metrics reported by these commands should be consistent except for number of packets sent and received
which should increase slowly over time.

~~~~~~~~~~~~~~
JMX Monitoring
~~~~~~~~~~~~~~

:ref:`Confluent Control Center <control_center>` monitors the Broker to |zk| connection as show :ref:`here <kafka_monitoring_metrics-zookeeper>`.
The |zk| server also provides a number of JMX metrics that are described in the project documentation `here <https://zookeeper.apache.org/doc/r3.4.10/zookeeperJMX.html>`__.
Here are a few JMX metrics that are important to monitor:

* NumAliveConnections - make sure you are not close to maximum as set with ``maxClientCnxns``
* OutstandingRequests - should be below 10 in general
* AvgRequestLatency - target below 10 ms
* MaxRequestLatency - target below 20 ms
* HeapMemoryUsage (Java built-in) - should be relatively flat and well below max heap size

In addition to JMX metrics |zk| provides, |ak| tracks a number of relevant |zk| events via the
SessionExpireListener that should be monitored to ensure the health of |zk|-|ak| interactions:

* ZooKeeperAuthFailuresPerSec (secure environments only)
* ZooKeeperDisconnectsPerSec
* ZooKeeperExpiresPerSec
* ZooKeeperReadOnlyConnectsPerSec
* ZooKeeperSaslAuthenticationsPerSec (secure environments only)
* ZooKeeperSyncConnectsPerSec


Multi-node Setup
~~~~~~~~~~~~~~~~

In a production environment, the |zk| servers will be deployed on multiple nodes. This is called an ensemble.
An ensemble is a set of ``2n + 1`` |zk| servers where ``n`` is any number greater than 0. The odd number of servers
allows |zk| to perform majority elections for leadership. At any given time, there can be up to ``n`` failed servers in
an ensemble and the |zk| cluster will keep ``quorum``. If at any time, ``quorum`` is lost, the |zk| cluster will go
down. A few general considerations for multi-node |zk| ensembles:

* Start with a small ensemble of three or five servers and only scale as truly necessary (or as required for fault tolerance).
  Each write must be propagated to a ``quorum`` of servers in the ensemble. This means that if you choose to run an ensemble
  of three servers, you are tolerant to one server being lost, and writes must propagate to two servers before they are
  committed. If you have five servers, you are tolerant to two servers being lost, but writes must propagate to three servers
  before they are committed.
* Redundancy in the physical/hardware/network layout: try not to put them all in the same rack, decent (but don't go nuts) hardware, try to keep redundant power and network paths, etc.
* Virtualization: place servers in different availability zones to avoid correlated crashes and to make sure that the storage system available can accommodate the requirements of the transaction logging and snapshotting of |zk|.
* When it doubt, keep it simple. |zk| holds important data, so prefer stability and durability over trying a new deployment model in production.

A multi-node setup does require a few additional configurations. There is a comprehensive overview of these in the
`project documentation <https://zookeeper.apache.org/doc/r3.4.10/zookeeperAdmin.html#sc_zkMulitServerSetup>`__. Below
is a concrete example configuration to help get you started.

.. codewithvars:: bash

  tickTime=2000
  dataDir=/var/lib/zookeeper/
  clientPort=2181
  initLimit=5
  syncLimit=2
  server.1=zoo1:2888:3888
  server.2=zoo2:2888:3888
  server.3=zoo3:2888:3888
  autopurge.snapRetainCount=3
  autopurge.purgeInterval=24

This example is for a 3 node ensemble. Note that this configuration file is expected to be identical across all members of the ensemble.
Breaking this down, you can see ``tickTime``, ``dataDir``, and ``clientPort`` are all set to typical single server values.
The ``initLimit`` and ``syncLimit`` are used to govern how long following |zk| servers can take to initialize with the
current leader and how long they can be out of sync with the leader. In this configuration, a follower can take 10000ms to
initialize and may be out of sync for up to 4000ms based on the ``tickTime`` being set to 2000ms.

The ``server.*`` properties set the ensemble membership. The format is ``server.<myid>=<hostname>:<leaderport>:<electionport>``. Some explanation:

* ``myid`` is the server identification number. In this example, there are three servers, so each one will have a different ``myid`` with values ``1``, ``2``, and ``3`` respectively. The ``myid`` is set by creating a file named ``myid`` in the ``dataDir`` that contains a single integer in human readable ASCII text. This value must match one of the ``myid`` values from the configuration file. If another ensemble member has already been started with a conflicting ``myid`` value, an error will be thrown upon startup.
* ``leaderport`` is used by followers to connect to the active leader. This port should be open between all |zk| ensemble members.
* ``electionport`` is used to perform leader elections between ensemble members. This port should be open between all |zk| ensemble members.

The ``autopurge.snapRetainCount`` and ``autopurge.purgeInterval`` have been set to purge all but three snapshots every 24 hours.

Post Deployment
~~~~~~~~~~~~~~~

After you deploy your |zk| cluster, |zk| largely runs without much maintenance. The `project documentation <https://zookeeper.apache.org/doc/r3.4.10/zookeeperAdmin.html#sc_administering>`__
contains many helpful tips on operations such as managing cleanup of the ``dataDir``, logging, and troubleshooting.