Optimizing for Availability

To optimize for high availability, you should tune your Kafka application to recover as quickly as possible from failure scenarios.

The values for some of the configuration parameters in this page depend on other factors, such as the average message size and number of partitions. These can differ greatly from environment to environment.

Some configuration parameters have a range of values, so benchmarking helps to validate the configuration for your application and environment.

Minimum Insync Replicas

When a producer sets acks=all or acks=-1, the configuration parameter min.insync.replicas specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful. If this minimum cannot be met, then the producer raises an exception. In the case of a shrinking ISR, the higher this minimum value is, the more likely there is to be a failure on producer send, which decreases availability for the partition. On the other hand, by setting this value low (for example, min.insync.replicas=1), the system tolerates more replica failures. As long as the minimum number of replicas is met, the producer requests continue to succeed, which increases availability for the partition.

Consumer failures

Consumers in a consumer group can share processing load. If a consumer unexpectedly fails, Kafka can detect the failure and rebalance the partitions amongst the remaining consumers in the consumer group. The consumer failures can be hard failures (for example, SIGKILL) or soft failures (for example, expired session timeouts). These failures can be detected either when consumers fail to send heartbeats or when they fail to send poll() calls. The consumer liveness is maintained with a heartbeat, now in a background thread since KIP-62, and the configuration parameter session.timeout.ms dictates the timeout used to detect failed heartbeats. Increase the session timeout to take into account potential network delays and to avoid soft failures. Soft failures occur most commonly in two cases: when poll() returns a batch of messages that take too long to process or when a JVM GC pause takes too long. If you have a poll() loop that spends much time processing messages, you can do one of the following:

  • Increase the upper bound on the amount of time that a consumer can be idle before fetching more records with max.poll.interval.ms.
  • Reduce the maximum size of batches the max.poll.records configuration parameter returns.

Although higher session timeouts increase the amount of time to detect and recover from a consumer failure, failed client incidents are less likely than network issues.

Restoring Task Processing State

Finally, when rebalancing workloads by moving tasks between event streaming application instances, you can reduce the time it takes to restore task processing state before the application instance resumes processing. In Kafka Streams, you can accomplish state restoration by replaying the corresponding changelog topic to reconstruct the state store. The application can replicate local state stores to minimize changelog-based restoration time by setting the configuration parameter num.standby.replicas. Thus, when you initialize a stream task or reinitialize on the application instance, the state store is restored to the most recent snapshot accordingly:

  • If a local state store doesn’t exist (num.standby.replicas=0), then the changelog is replayed from the earliest offset.
  • If a local state store does exist (num.standby.replicas is greater than 0), the changelog is replayed from the previously checkpointed offset. This method takes less time because it is applies a smaller portion of the changelog.

Summary of Configurations for Optimizing Availability

Consumer

  • session.timeout.ms: increase (default 10000)

Streams

  • StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG: 1 or more (default 0)
  • Streams applications have embedded producers and consumers, so also check those configuration recommendations