Accessing Cloud Data
Also available as:
loading table of contents...

Putting it All Together: spark-defaults.conf

Combining the performance settings for ORC and Parquet input, produces the following set of options to set in the spark-defaults.conf file for Spark applications.

spark.hadoop.fs.s3a.experimental.input.policy random
spark.sql.orc.filterPushdown true
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.parquet.mergeSchema false
spark.sql.parquet.filterPushdown true
spark.sql.hive.metastorePartitionPruning true

When working with S3, the S3A Directory committer should be enabled for both performance and safety: directory