Optimizing for Availability¶
To optimize for high availability, you should tune your Kafka application to recover as quickly as possible from failure scenarios.
Some configuration parameters in this page have a range of values. How you set them depends on your requirements and other factors, such as the average message size, number of partitions, and other differences in the environment. So, benchmarking helps to validate the configuration for your application and environment.
Minimum Insync Replicas¶
When a producer sets
acks=-1, the configuration parameter
min.insync.replicas specifies the minimum number of replicas that must
acknowledge a write for the write to be considered successful. If this minimum
cannot be met, then the producer raises an exception. In the case of a shrinking
ISR, the higher this minimum value is, the more likely there is to be a failure
on producer send, which decreases availability for the partition. On the other
hand, by setting this value low (for example,
system tolerates more replica failures. As long as the minimum number of
replicas is met, the producer requests continue to succeed, which increases
availability for the partition.
Consumers in a consumer group can share processing load. If a consumer
unexpectedly fails, Kafka can detect the failure and rebalance the partitions
amongst the remaining consumers in the consumer group. The consumer failures can
be hard failures (for example,
SIGKILL) or soft failures (for example,
expired session timeouts). These failures can be detected either when consumers
fail to send heartbeats or when they fail to send
poll() calls. The consumer
liveness is maintained with a heartbeat, now in a background thread since
and the configuration parameter
session.timeout.ms dictates the timeout used
to detect failed heartbeats. Increase the session timeout to take into account
potential network delays and to avoid soft failures. Soft failures occur most
commonly in two cases: when
poll() returns a batch of messages that take too
long to process or when a JVM GC pause takes too long. If you have a
loop that spends much time processing messages, you can do one of the following:
- Increase the upper bound on the amount of time that a consumer can be idle
before fetching more records with
- Reduce the maximum size of batches the
max.poll.recordsconfiguration parameter returns.
Although higher session timeouts increase the amount of time to detect and recover from a consumer failure, failed client incidents are less likely than network issues.
Restoring Task Processing State¶
Finally, when rebalancing workloads by moving tasks between event streaming
application instances, you can reduce the time it takes to restore task
processing state before the application instance resumes processing. In
Kafka Streams, you can accomplish state restoration by replaying the
corresponding changelog topic to reconstruct the state store. The application
can replicate local state stores to minimize changelog-based restoration time by
setting the configuration parameter
num.standby.replicas. Thus, when you
initialize a stream task or reinitialize on the application instance, the state
store is restored to the most recent snapshot accordingly:
- If a local state store doesn’t exist (
num.standby.replicas=0), then the changelog is replayed from the earliest offset.
- If a local state store does exist (
num.standby.replicasis greater than 0), the changelog is replayed from the previously checkpointed offset. This method takes less time because it is applies a smaller portion of the changelog.