Handling bucketed tables
If you migrated data from earlier Apache Hive versions to Hive 3, you might need to handle bucketed tables that impact performance.
- As files in the directory for the table.
- As directories of partitions if the table is partitioned.
A common challenge related to using buckets is maintaining query performance while the workload or data scales up or down. For example, you could have an environment that operates smoothly using 16 buckets to support 1000 users, but a spike in the number of users to 100,000 for a day or two creates problems if you do not promptly tune the buckets and partitions. Tuning the buckets is complicated by the fact that after you have constructed a table with buckets, the entire table containing the bucketed data must be reloaded to reduce, add, or eliminate buckets.
With Tez, you only need to deal with the buckets of the biggest table. If workload demands change rapidly, the buckets of the smaller tables dynamically change to complete table JOINs.
You perform the following tasks related to buckets:
- Setting hive-site.xml to enable buckets
- Bulk-loading tables that are both partitioned and
When you load data into tables that are both partitioned and bucketed, set the following property to optimize the process:
If you have 20 buckets on user_id data, the following query returns only the data associated
with user_id = 1:
SELECT * FROM tab WHERE user_id = 1;
To best leverage the dynamic capability of table buckets on Tez, adopt the following practices:
- Use a single key for the buckets of the largest table.
- Usually, you need to bucket the main table by the biggest dimension table. For example, the sales table might be bucketed by customer and not by merchandise item or store. However, in this scenario, the sales table is sorted by item and store.
- Normally, do not bucket and sort on the same column.
A table that has more bucket files than the number of rows is an indication that you should reconsider how the table is bucketed.