Tuning KafkaSpout Performance
KafkaSpout provides two internal parameters to control performance:
max.uncommitted.offsetsdefines the maximum number of polled offsets (records) that can be pending commit before another poll can take place. When this limit is reached, no more offsets can be polled until the next successful commit sets the number of pending offsets below the threshold. To set this parameter, use the KafkaSpoutConfig set method setMaxUncommittedOffsets.
Note that these two parameters trade off memory versus time:
offset.commit.period.msis set to a low value, the spout commits to Kafka more often. When the spout is committing to Kafka, it is not fetching new records nor processing new tuples.
max.uncommitted.offsetsincreases, the memory footprint increases. Each offset uses eight bytes of memory, which means that a value of 10000000 (10MB) uses about 80MB of memory.
It is possible to achieve good performance with a low commit period and small
memory footprint (a small value for
max.uncommitted.offsets), as well
as with a larger commit period and larger memory footprint. However, you should
avoid using large values for
offset.commit.period.ms with a low value
Kafka consumer configuration parameters can also have an impact on the KafkaSpout performance. The following Kafka parameters are most likely to have the strongest impact on KafkaSpout performance:
Kafka consumer parameter
fetch.min.bytesspecifies the minimum amount of data the server returns for a fetch request. If the minimum amount is not available, the request waits until the minimum amount accumulates before answering the request.
Kafka consumer parameter
fetch.max.wait.msspecifies the maximum amount of time the server will wait before answering a fetch request, when there is not sufficient data to satisfy
For HDP 2.5.0 clusters in production use, you should override the default values of KafkaSpout
Performance also depends on the structure of your Kafka cluster, the distribution of the data, and the availability of data to poll.
Log Level Performance Impact
Storm supports several logging levels, including Trace, Debug, Info, Warn, and Error. Trace-level logging has a significant impact on performance, and should be avoided in production. The amount of log messages is proportional to the number of records fetched from Kafka, so a lot of messages are printed when Trace-level logging is enabled.
Trace-level logging is most useful for debugging pre-production
environments under mild load. For debugging, if necessary, you can throttle how many
messages are polled from Kafka by setting the
parameter to a low number that is larger than than the largest single message stored
Logs with Debug level will have slightly less performance impact than Trace-level logs, but still generate a lot of messages. This setting can be useful for assessing whether the Kafka spout is properly tuned.