Improving ORC and Parquet Read Performance

Minimize Read and Write Operations for ORC

For optimal performance when reading files saved in the ORC format, read and write operations must be minimized. To achieve this, set the following options:

spark.sql.orc.filterPushdown true
    spark.sql.hive.metastorePartitionPruning true

The spark.sql.orc.filterPushdown option enables the ORC library to skip unneeded columns and to use index information to filter out parts of the file where it can be determined that no columns match the predicate.

With the spark.sql.hive.metastorePartitionPruning option enabled, predicates are pushed down into the Hive metastore to eliminate unmatched partitions.

Minimize Read and Write Operations for Parquet

For optimal performance when reading files saved in the Parquet format, read and write operations must be minimized, including generation of summary metadata, and coalescing metadata from multiple files. The predicate pushdown option enables the Parquet library to skip unneeded columns, saving bandwidth. To achieve this, set the following options:

spark.hadoop.parquet.enable.summary-metadata false
    spark.sql.parquet.mergeSchema false
    spark.sql.parquet.filterPushdown true
    spark.sql.hive.metastorePartitionPruning true

​Improving ORC and Parquet Read Performance

Improving ORC and Parquet Read Performance