Understanding and Administering Hive Compactions

Hive stores data in base files that cannot be updated by HDFS. Instead, Hive creates a set of delta files for each transaction that alters a table or partition and stores them in a separate delta directory. Occasionally, Hive compacts, or merges, the base and delta files. Hive performs all compactions in the background without affecting concurrent reads and writes of Hive clients. There are two types of compactions:

Table 2.3. Hive Compaction Types

Compaction Type	Description
Minor	Rewrites a set of delta files to a single delta file for a bucket.
Major	Rewrites one or more delta files and the base file as a new base file for a bucket.

By default, Hive automatically compacts delta and base files at regular intervals. However, Hadoop administrators can configure automatic compactions, as well as perform manual compactions of base and delta files using the following configuration parameters in hive-site.xml.

Table 2.4. Hive Transaction Configuration Parameters

Configuration Parameter	Description
`hive.txn.manager`	Specifies the class name of the transaction manager used by Hive. Set this property to `org.apache.hadoop.hive.ql.lockmgr. DbTxnManager` to enable transactions. The default value is `org.apache.hadoop.hive.ql.lockmgr. DummyTxnManager`, which disables transactions.
`hive.compactor.initiator.on`	Specifies whether to run the initiator and cleaner threads on this Metastore instance. The default value is `false`. Must be set to `true` for exactly one instance of the Hive metastore service.
`hive.compactor.worker.threads`	Specifies the number of of worker threads to run on this Metastore instance. The default value is 0, which must be set to greater than 0 to enable compactions. Worker threads initialize MapReduce jobs to do compactions. Increasing the number of worker threads decreases the time required to compact tables after they cross a threshold that triggers compactions. However, increasing the number of worker threads also increases the background load on a Hadoop cluster.
`hive.compactor.worker.timeout`	Specifies the time period, in seconds, after which a compaction job is failed and re-queued. The default value is 86400 seconds, or 24 hours.
`hive.compactor.check.interval`	Specifies the time period, in seconds, between checks to see if any partitions require compacting. The default value is 300 seconds. Decreasing this value reduces the time required to start a compaction for a table or partition. However, it also increases the background load on the NameNode since each check requires several calls to the NameNode.
`hive.compactor.delta.num.threshold`	Specifies the number of delta directories in a partition that triggers an automatic minor compaction. The default value is 10.
`hive.compactor.delta.pct.threshold`	Specifies the percentage size of delta files relative to the corresponding base files that triggers an automatic major compaction. The default value is.1, which is 10 percent.
`hive.compactor.abortedtxn.threshold`	Specifies the number of aborted transactions on a single partition that trigger an automatic major compaction.

​Understanding and Administering Hive Compactions

Understanding and Administering Hive Compactions