Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Improving Hive Performance with Cloud Object Stores

Tune the following parameters to improve Hive performance when working with Cloud Object Stores

Table 7.1. Improving General Performance

ParameterRecommended Setting
yarn.scheduler.capacity.node-locality-delaySet this to "0".
hive.warehouse.subdir.inherit.permsSet this to "false" to reduce the number of file permission checks.
hive.metastore.pre.event.listenersSet this to an empty value to reduce the number of directory permission checks.

You can set these parameters in hive-site.xml.

Table 7.2. Accelerating ORC Reads in Hive

ParameterRecommended Setting
hive.orc.compute.splits.num.threads

If using ORC format and you want improve the split computation time, you can set the value of this parameter to match the number of available processors. By default, this parameter is set to 10.

This parameter controls the number of parallel threads involved in computing splits. For Parquet computing splits is still single-threaded, so split computations can take longer with Parquet and Cloud Object Stores.

hive.orc.splits.include.file.footerIf using ORC format with ETL file split strategy, you can set this parameter to "true" in order to use existing file footer information in split payload.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Table 7.3. Accelerating ETL Jobs

ParameterRecommended Setting

hive.metastore.fshandler.threads

Query launches can be slightly slower if there are no stats available or when hive.stats.fetch.partition.stats=false. In such cases, Hive ends up looking at file sizes for every file that it tries to access.

Tuning hive.metastore.fshandler.threads helps reduce the overall time taken for the metastore operation.

fs.trash.intervalDrop table can be slow in object stores such as S3 because the action involves moving files to trash (a copy + delete). To remedy this, you can set fs.trash.interval=0 to completely skip trash.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Accelerating Inserts in Hive

When inserting data, Hive renames data from a temporary folder to the final location. This move operation is actually a copy+delete action, which can be expensive in object stores such as S3; the more data is being written out to the object store, the more expensive the operation is.

To accelerate the process, you can tune hive.mv.files.thread, depending on the size of your dataset (default is 15). You can set it in hive-site.xml.