Statistics replication

Basic statistics such as the number of rows of a table or partition and the column statistics such as histograms (min, max, count) of a particular interesting column are important in many ways.

One of the key use cases of statistics is query optimization.

Hive supports Cost Based Optimizer (CBO) which primarily depends on column statistics to optimize the query execution plan. For example, if the min and max values of a integer type column “c” in a partition “p” is min=10 and max=50, then a query with predicate such as c < 10 or c > 50 or anything that does not fall under the range, shall avoid scanning the partition “p”. For transactional tables, column statistics are supported only if hive.txn.stats.enabled=true.

Currently, there are two modes to compute statistics:

ANALYZE command: analyze table t [partition p] compute statistics for [columns c,...];
Hive automatically computes the statistics whenhive.stats.autogather=true for basic statistics andhive.stats.column.autogather=truefor column statistics.

As statistics are key for query optimization, when table/partition data is replicated to target cluster, it is important to replicate the statistics as well which would speed up the queries running on target clusters. One of the key requirement for statistics replication to work accurately is it’s consistency with current dataset in a table or partition. Hive replication can achieve it with point-in time consistent incremental replication model. If any database/table is bootstrapped, then corresponding basic and columns statistics (if present) would also be bootstrapped.

As Hive replication takes care of replicating statistics to target cluster, it is recommended to disable hive.stats.autogatherandhive.stats.column.autogather at the target cluster. If it is enabled, it would cause additional computation cost in the target cluster.

	Important
	For existing policies, any statistics gathered after deploying DLM 1.4.0 will be replicated to the target cluster. For any new policies created using DLM 1.4.0 release onwards, statistics is replicated to the target cluster.