Optimizing and Tuning¶
After you complete unit and integration testing, you can benchmark and optimize your applications based on your service goals to tune performance.
Benchmark testing is important because there is no one-size-fits-all recommendation for the configuration parameters you need to develop Kafka applications to Confluent Cloud. Proper configuration always depends on the use case, other features you have enabled, the data profile, and more. You should run benchmark tests if you plan to tune Kafka clients beyond the defaults. Regardless of your service goals, you should understand what the performance profile of your application is—it is especially important when you want to optimize for throughput or latency. Your benchmark tests can also feed into the calculations for determining the correct number of partitions and the number of producer and consumer processes.
First, measure your bandwidth using the Kafka tools
For non-JVM clients that wrap librdkafka, you can use the rdkafka_performance interface.
This first round of results provides a baseline performance to your
Confluent Cloud instance, taking application logic out of the equation.
Note that these perf tools do not support Schema Registry.
Then test your application, starting with the default Kafka configuration parameters, and familiarize yourself with the default values.
Determine the baseline input performance profile for a given producer by removing dependencies on anything upstream from the producer. Rather than receiving data from upstream sources, modify your producer to generate its own mock data at high output rates, such that the data generation is not a bottleneck. Ensure the mock data reflects the type of data used in production to produce results that more accurately reflect performance in production. Or, instead of using mock data, consider using copies of production data or cleansed production data in your benchmarking.
If you test with compression, be aware of how the mock data is generated. Sometimes mock data is unrealistic, containing repeated substrings or being padded with zeros, which may result in a better compression performance than what would be seen in production.
- Run a single producer client on a single server and measure the resulting throughput using the available JMX metrics for the Kafka producer. Repeat the producer benchmarking test, increasing the number of producer processes on the server in each iteration to determine the number of producer processes per server to achieve the highest throughput.
- Determine the baseline output performance profile for a given consumer in a similar way. Run a single consumer client on a single server and repeat this test, increasing the number of consumer processes on the server in each iteration to determine the number of consumer processes per server to achieve the highest throughput.
- Run benchmark tests for different permutations of configuration parameters that reflect your service goals. Focus on a subset of configuration parameters, and avoid the temptation to discover and change other parameters from their default values without understanding exactly how they impact the entire system.
Tune the settings on each iteration, run a test, observe the results, tune again, and so on, until you identify settings that work for your throughput and latency requirements.
Refer to this blog post when considering partition count in your benchmark tests.
Determining your Service Goals¶
Though it may take only a few seconds to get your Kafka client application up and running to Confluent Cloud, you should tune your application before going into production. Since different use cases have different sets of requirements that drive different service goals, you must decide what is your service goal:
You should consider the following criteria when identifying your service goals:
- The use cases your Kafka applications serves.
- Your applications and business requirements–elements that can’t fail for the use case to be satisfied.
- How Kafka fits into the pipeline of your business.
While it may be hard to answer the question of which metrics to optimize, it is important that you discuss the original business use cases and main goals with your team for the following two reasons:
You are unable to maximize all goals at the same time. There are occasionally trade-offs between throughput, latency, durability, and availability. You may be familiar with the common trade-off in performance between throughput and latency and perhaps between durability and availability as well. As you consider the whole system, you may find that you can’t consider about any of them in isolation, which is why this paper looks at all four service goals together. This doesn’t mean that optimizing one of these goals results in completely losing out on the others. It just means that they are all interconnected, and thus you can’t maximize all of them at the same time.
You must identify the service goals you want to optimize so you can tune your Kafka configuration parameters to achieve them, and you must understand what your users expect from the system to ensure you are optimizing Kafka to meet their needs. You should take time to answer the following questions:
Do you want to optimize for high throughput, which is the rate that data is moved from producers to brokers or brokers to consumers?
Use case: an application with millions of writes per second. Because of Kafka’s design, writing large volumes of data into it isn’t a hard thing to do. It’s faster than trying to push volumes of data through a traditional database or key-value store, and it can be done with modest hardware.
Do you want to optimize for low latency, which is the time elapsed moving messages end to end (from producers to brokers to consumers)?
Use case: A chat application, where the recipient of a message needs to get the message with as little latency as possible. Other examples include interactive websites where users follow posts from friends in their network, or real-time stream processing for the Internet of Things (IoT).
Do you want to optimize for high durability, which guarantees that committed messages will not be lost?
Use case: An event streaming microservices pipeline using Kafka as the event store. Another is for integration between an event streaming source and some permanent storage (for examples, Amazon S3) for mission-critical business content.
Do you want to optimize for high availability, which minimizes downtime in case of unexpected failures?
Use case: An application that must be always on. Kafka should be optimized to recover from failures as quickly as possible.