Apache Hive Performance Tuning
Also available as:
PDF

Improving performance using partitions

Using partitions can significantly improve performance. You can design Hive table and materialized views partitions to map to physical directories on the file system. For example, a table partitioned by date-time can organize data loaded into Hive each day.

Large deployments can have tens of thousands of partitions. Partition pruning occurs indirectly when Hive discovers the partition key during query processing. For example, after joining with a dimension table, the partition key might come from the dimension table. A query filters columns by partition, pruning partition scanning to one or a few matching partitions. Partition pruning occurs directly when a partition key is present in the WHERE clause. Partitioned columns are virtual, not written into the main table because these columns are the same for the entire partition. In a SQL query, you define the partition as shown in the following example:

CREATE TABLE sale(id in, amount decimal)
PARTITIONED BY (xdate string, state string);

To insert data into this table, you specify the partition key for fast loading:

INSERT INTO sale (xdate='2016-03-08', state='CA')
SELECT * FROM staging_table
WHERE xdate='2016-03-08' AND state='CA';

You do not need to specify dynamic partition columns. Hive generates a partition specification if you enable dynamic partitions.

hive-site.xml settings for loading 1 to 9 partitions:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;

For bulk-loading data into partitioned ORC tables, you use the following property, which optimizes the performance of data loading into 10 or more partitions.

hive-site.xml setting for loading 10 or more partitions:
hive.optimize.sort.dynamic.partition=true

Examples of HiveQL query on partitioned data

INSERT INTO sale (xdate, state)
SELECT * FROM staging_table;

Follow these best practices when you partition tables and query partitioned tables:

  • Never partition on a unique ID.
  • Size partitions so that on average they are greater than or equal to 1 GB.
  • Formulate a query so that it does not process more than 1000 partitions.