Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Accelerating S3 Read Performance

The most effective way to read a large file from S3 is in a single HTTPS request, reading in all data from the beginning to the end. This is exactly the read pattern used when the source data is a CSV file or files compressed with GZIP.

ORC and Parquet files benefit from Random IO: they read footer data, seek backwards, and skip forwards, to minimize the amount of data read. This IO pattern is highly efficient for HDFS, but for object stores, making and breaking new HTTP connections, then this IO pattern is very very expensive.

By default, as soon as an application makes a backwards seek() in a file, the S3A connector switches into “random” IO mode, where instead of trying to read the entire file, only the amount configured in fs.s3a.readahead.range is read in. This results in an IO behavior where, at the possible expense of breaking the first HTTP connection, reading ORC/Parquet data is efficient.

See Optimizing S3A read performance for different file types.