Basic statistics such as the number of rows of a table or partition and the column statistics such as histograms (min, max, count) of a particular interesting column are important in many ways.
One of the key use cases of statistics is query optimization.
Hive supports Cost Based Optimizer (CBO) which primarily depends on column
statistics to optimize the query execution plan. For example, if the min and max values
of a integer type column “c” in a partition “p” is min=10 and max=50, then
a query with predicate such as c < 10 or c > 50 or anything that does
not fall under the range, shall avoid scanning the partition “p”. For
transactional tables, column statistics are supported only if
Currently, there are two modes to compute statistics:
- ANALYZE command:
analyze table t [partition p] compute statistics for [columns c,...];
- Hive automatically computes the statistics
hive.stats.autogather=truefor basic statistics and
hive.stats.column.autogather=truefor column statistics.
As statistics are key for query optimization, when table/partition data is replicated to target cluster, it is important to replicate the statistics as well which would speed up the queries running on target clusters. One of the key requirement for statistics replication to work accurately is it’s consistency with current dataset in a table or partition. Hive replication can achieve it with point-in time consistent incremental replication model. If any database/table is bootstrapped, then corresponding basic and columns statistics (if present) would also be bootstrapped.
As Hive replication takes care of replicating statistics to target cluster, it
is recommended to disable
at the target cluster. If it is enabled, it would cause additional computation cost in
the target cluster.
For existing policies, any statistics gathered after deploying DLM 1.4.0 will be replicated to the target cluster. For any new policies created using DLM 1.4.0 release onwards, statistics is replicated to the target cluster.