Using Apache Storm to Move Data
Also available as:
PDF

Tuning KafkaSpout Performance

KafkaSpout provides two internal parameters to control performance:

  • offset.commit.period.ms specifies the period of time (in milliseconds) after which the spout commits to Kafka. To set this parameter, use the KafkaSpoutConfig set method setOffsetCommitPeriodMs.

  • max.uncommitted.offsets defines the maximum number of polled offsets (records) that can be pending commit before another poll can take place. When this limit is reached, no more offsets can be polled until the next succesful commit sets the number of pending offsets below the threshold. To set this parameter, use the KafkaSpoutConfig set method setMaxUncommittedOffsets.

Note that these two parameters trade off memory versus time:

  • When offset.commit.period.ms is set to a low value, the spout commits to Kafka more often. When the spout is committing to Kafka, it is not fetching new records nor processing new tuples.

  • When max.uncommitted.offsets increases, the memory footprint increases. Each offset uses eight bytes of memory, which means that a value of 10000000 (10MB) uses about 80MB of memory.

It is possible to achieve good performance with a low commit period and small memory footprint (a small value for max.uncommitted.offsets), as well as with a larger commit period and larger memory footprint. However, you should avoid using large values for offset.commit.period.ms with a low value for max.uncommitted.offsets.

Kafka consumer configuration parameters can also have an impact on the KafkaSpout performance. The following Kafka parameters are most likely to have the strongest impact on KafkaSpout performance:

  • The Kafka Consumer poll timeout specifies the time (in milliseconds) spent polling if data is not available. To set this parameter, use the KafkaSpoutConfig set method setPollTimeoutMs.

  • Kafka consumer parameter fetch.min.bytes specifies the minimum amount of data the server returns for a fetch request. If the minimum amount is not available, the request waits until the minimum amount accumulates before answering the request.

  • Kafka consumer parameter fetch.max.wait.ms specifies the maximum amount of time the server will wait before answering a fetch request, when there is not sufficient data to satisfy fetch.min.bytes.

Important
Important

For HDP 2.5.0 clusters in production use, you should override the default values of KafkaSpout parameters offset.commit.period and max.uncommitted.offsets, and Kafka consumer parameter poll.timeout.ms, as follows:

  • Set poll.timeout.ms to 200.

  • Set offset.commit.period.ms to 30000 (30 seconds).

  • Set max.uncommitted.offsets to 10000000 (ten million).

Performance also depends on the structure of your Kafka cluster, the distribution of the data, and the availability of data to poll.

Log Level Performance Impact

Storm supports several logging levels, including Trace, Debug, Info, Warn, and Error. Trace-level logging has a significant impact on performance, and should be avoided in production. The amount of log messages is proportional to the number of records fetched from Kafka, so a lot of messages are printed when Trace-level logging is enabled.

Trace-level logging is most useful for debugging pre-production environments under mild load. For debugging, if necessary, you can throttle how many messages are polled from Kafka by setting the max.partition.fetch.bytes parameter to a low number that is larger than than the largest single message stored in Kafka.

Logs with Debug level will have slightly less performance impact than Trace-level logs, but still generate a lot of messages. This setting can be useful for assessing whether the Kafka spout is properly tuned.

For general information about Apache Storm logging features, see Monitoring and Debugging an Apache Storm Topology.