Accessing Cloud Data
Also available as:
PDF
loading table of contents...

S3 Performance Checklist

Use this checklist to ensure optimal performance when working with data in S3.

Checklist for Data

  • [ ] Amazon S3 bucket is in same region as the EC2-hosted cluster. Learn more

  • [ ] The directory layout is "shallow". For directory listing performance, the directory layout prefers "shallow" directory trees with many files over deep directory trees with only a few files per directory.

  • [ ] The "pseudo" block size set in fs.s3a.block.size is appropriate for the work to be performed on the data.

  • [ ] Copy to HDFS any data that needs to be repeatedly read to HDFS.

Checklist for Cluster Configs

  • [ ] Set yarn.scheduler.capacity.node-locality-delay to 0 to improve container launch times.

  • [ ] When copying data using DistCp, use the following performance optimizations.

  • [ ] If planning to use Hive with S3, review Improving Hive Performance with Cloud Object Stores.

  • [ ] If planning to use Spark with S3, review Improving Spark Performance with Cloud Object Stores.

Checklist for Code

  • [ ] Application does not make rename() calls. Where it does, it does not assume the operation is immediate.

  • [ ] Application does not assume that delete() is near-instantaneous.

  • [ ] Application uses FileSystem.listFiles(path, recursive=true) to list a directory tree.

  • [ ] Application prefers forward seeks through files, rather than full random IO.

  • [ ] If making "random" IO through seek() and read() sequences or and Hadoop's PositionedReadable API, fs.s3a.experimental.input.fadvise is set to random. Learn more

More