Chapter 1. Optimizing an Apache Hive Data Warehouse

Using a Hive-based data warehouse requires setting up the appropriate environment for your needs. After you establish the computing paradigm and architecture, you can tune the data warehouse infrastructure, interdependent components, and your client connection parameters to improve the performance and relevance of business intelligence (BI) and other data-analytic applications.

Tuning Hive and other Apache components that run in the background to support processing of HiveQL is particularly important as the scale of your workload and database volume increases. When your applications query data sets that constitute a large-scale enterprise data warehouse (EDW), tuning the environment and optimizing Hive queries are often part of an ongoing effort by IT or DevOps teams to ensure service-level agreement (SLA) benchmarks or other performance expectations.

Increasingly, most enterprises require that Hive queries run against the data warehouse with low-latency analytical processing, which is often referred to as LLAP by Hortonworks. LLAP of real-time data can be further enhanced by integrating the EDW with the Druid business intelligence engine.

	Tip
	The best approach is to use Apache Ambari to configure and monitor applications and queries that run on a Hive data warehouse. These tips are described throughout this guide.

Hive Processing Environments

The environment that you use to process queries and return results can depend on one or more factors, such as the capacity of your system resources, how in-depth you want to analyze data, how quickly you want queries to return results, or what tradeoffs that you can accept to favor one model over another.

Overall Architecture

A brief overview of the components and architecture of systems using Hive EDW for data processing is in the Hive Architectural Overview of HDP 2.5. With a few exceptions, the architecture information there applies to both batch processing and LLAP of Hive queries. However, there are some differences in the way the components of an environment processing batch workloads operate from the functioning of the same components in a Hive LLAP environment.

Dependencies for Optimal Hive Query Processing

Increasingly, enterprises want to run SQL workloads that return faster results than batch processing can provide. Hortonworks Data Platform (HDP) supports Hive LLAP, which enables application development and IT infrastructure to run queries that return real-time or near-real-time results. Use cases for implementing this technology include environments where users of business intelligence (BI) tools or web dashboards need to accelerate analysis of data stored in a Hive EDW.

A performance benchmark that enterprises increasingly want to reach with data analytics applications is support for interactive queries. Interactive queries are queries on Hive data sets that meet low-latency benchmarks that are variably gauged but for Hive LLAP in HDP is specified as 15 seconds or less.

	Important
	Hive LLAP with Apache Tez utilizes newer technology available in Hive 2.x to be an increasingly needed alternative to other execution engines like MapReduce and earlier implementations of Hive on Tez. Tez runs in conjunction with Hive LLAP to form a newer execution engine architecture that can support faster queries.

	Important
	The Hive LLAP with Tez engine requires a different Apache Hadoop YARN configuration from the configuration required for environments where Hive on Tez is the execution engine. With Ambari 2.5.0 and later versions, you can more easily enable and configure YARN components that are the foundation of Hive LLAP than you could in previous HDP releases.

Connectivity to Admission Control (HiveServer2)

HiveServer2 is a service that enables multiple clients to simultaneously execute queries against Hive using an open API driver, such as JDBC or ODBC.

For optimal performance, use HiveServer2 as the connectivity service between your client application and the Hive EDW. HiveServer1 is deprecated because HiveServer2 has improvements for multiclient concurrency and authentication. Also, HiveServer2 is designed to provide better support for open API clients like JDBC and ODBC.

HiveServer2 is one of several architectural components for admission control, which enables optimal Hive performance when multiple user sessions generate asynchronous threads simultaneously. Admission control operates by scaling the Hive processing of concurrent queries to a workload that is suited to the system resources and to the total demand of incoming threads, while holding the other queries for later processing or cancelling the queries if conditions warrant this action. Admission control is akin to “connection pooling” in RDBMS databases.

To optimize Hive performance, you must set parameters that affect admission control according to your needs and system resources.

	Important
	HiveServer2 coordinates admission control in conjunction with YARN and Apache Tez for batch queries and with YARN and the LLAP daemons for interactive queries.

Execution Engines (Apache Tez and Hive LLAP)

Both the Hive on Tez engine for batch queries and the enhanced Tez + Hive LLAP engine run on YARN nodes.

Tez Execution on YARN

Hive on Tez is an advancement over earlier application frameworks for Hadoop data processing, such as using Hive on MapReduce2 or MapReduce1. The Tez framework is required for high-performance batch workloads. Tez is also part of the execution engine for Hive LLAP.

After query compilation, HiveServer2 generates a Tez graph that is submitted to YARN. A Tez ApplicationMaster (AM) monitors the query while it is running.

The maximum number of queries that can be run concurrently is limited by the number of ApplicationMasters.

Hive LLAP Execution Engine

The architecture of Hive LLAP is illustrated in the following diagram.

HiveServer2: provides JDBC and ODBC interface, and query compilation
Query coordinators: coordinate the execution of a single query LLAP daemon: persistent server, typically one per node. This is the main differentiating component of the architecture, which enables faster query runtimes than earlier execution engines.
Query executors: threads running inside the LLAP daemon
In-memory cache: cache inside the LLAP daemon that is shared across all users

Workload Management with Queues and Containers (Hive, YARN, and Tez)

Batch Processing

Each queue must have the capacity to support one complete Tez Application, as defined by its ApplicationMaster (AM). Consequently, the maximum number of queries that can be run concurrently is also limited by the number of Apache Tez Application Masters.

A Hive-based analytic application relies on execution resources called YARN containers. Containers are defined by the Hive configuration. The number and longevity of containers that reside in your environment depend on whether you want to run with batch workloads or enable Hive LLAP in HDP.

Interactive Workloads

Interactive workloads operate with YARN and queues differently from the way that batch workloads manage workloads.

When using the Hive LLAP on Tez engine, Admission Control is handled differently than for earlier Hive on Tez implementations. Resources are managed by Hive LLAP globally, rather than each Tez session managing its own.

Hive LLAP has its own resource scheduling and pre-emption built in that doesn't rely on YARN. As a result, a single queue is needed to manage all LLAP resources. In addition, each LLAP daemon runs as a single YARN container.

SQL Planner and Optimizer (Apache Hive and Apache Calcite)

A cost-based optimizer (CBO) generates more efficient query plans. In Hive, the CBO is enabled by default, but it requires that column statistics be generated for tables. Column statistics can be expensive to compute so they are not automated. Hive has a CBO that is based on Apache Calcite and an older physical optimizer. All of the optimizations are being migrated to the CBO. The physical optimizer performs better with statistics, but the CBO requires statistics.

Storage Formats

Hive supports various file formats. You can write your own SerDes (Serializers, Deserializers) interface to support new file formats.

	Tip
	The Optimized Row Columnar (ORC) file format for data storage is recommended because this format provides the best Hive performance overall.

Storage Layer (Example: HDFS Filesystem)

While a Hive EDW can run on one of a variety of storage layers, HDFS and Amazon S3 are the most prevalently used and known filesystems for data analytics that run in the Hadoop stack. By far, the most common filesystem used for a public cloud infrastructure is Amazon S3.

A Hive EDW can store data on other filesystems, including WASB and ADLS.

Depending on your environment, you can tune the filesystem to optimize Hive performance by configuring compression format, stripe size, partitions, and buckets. Also, you can create bloom filters for columns frequently used in point lookups.

Setting up Hive LLAP

	Important
	Using Ambari 2.5.0+ to enable Hive LLAP and configure most of its basic parameters is highly recommended for most users. Ambari not only has a GUI to ease the tasks, but also contains multiple wizards that can automatically tune interactive query property settings to suit your environment.

While most of the Hive LLAP installation and configuration steps can be completed in Ambari, you must manually configure two properties in the yarn-site.xml file before sliding the Enable Interactive Query toggle to "Yes." Then there are two paths for enabling Hive LLAP using Ambari: Typical Setup and Advanced Setup. Typical Setup is recommended for most users because it requires less decision-making and leverages more autotuning features of Ambari than the Advanced Setup.

Enabling YARN Preemption for Hive LLAP

About this Task

You must enable and configure YARN preemption, which directs the Capacity Scheduler to position a Hive LLAP queue as the top-priority workload to run among cluster node resources. See YARN Preemption for more information about how YARN preemption functions.

Steps

In Ambari, select Services > YARN > Configs tab > Settings subtab.
Set the Pre-emption slider of the YARN Features section to Enabled:

Figure 1.1. YARN Features Pane
Click the Advanced subtab.
Open the Custom yarn-site drop-down menu to reveal a group of customizable YARN properties.
Find the two following properties:

Set the two properties listed in the following table, which also provides recommendations for how to determine what values to use for the settings.

Table 1.1. Manual Configuration of Custom yarn-site Properties for Enabling Hive LLAP

Property Name Recommended Setting

Property Name	Recommended Setting
`yarn.resourcemanager.monitor.` `capacity.preemption.natural_termination_factor`	`1`
`yarn.resourcemanager.monitor.capacity.` `preemption.total_preemption_per_round`	Calculate the value by dividing 1 by the number of cluster nodes. Enter the value as a decimal. Example: If your cluster has 20 nodes, then divide 1 by 20 and enter `0.05` as the value of this property setting.

yarn.resourcemanager.monitor.

capacity.preemption.natural_termination_factor

1

yarn.resourcemanager.monitor.capacity.

preemption.total_preemption_per_round

Calculate the value by dividing 1 by the number of cluster nodes. Enter the value as a decimal.

Example: If your cluster has 20 nodes, then divide 1 by 20 and enter 0.05 as the value of this property setting.

Click Save in the upper right area of the window.

Next Step

Complete either the Enable Hive LLAP: Typical Setup task or the Enable Hive LLAP: Advanced Setup in Ambari in the following sections.

Enable Hive LLAP: Typical Setup

About this Task

Follow this procedure if you are new to Hive LLAP or prefer to let autotuning features of Ambari configure interactive queries.

Prerequisites

Installation of Ambari 2.5.x
The Hive Service and other interdependencies as prompted in Ambari must be running.
YARN preemption must be enabled and configured as documented in the Enabling YARN Preemption for Hive LLAP section above.
If enabled, you must disable maintenance mode for the Hive service and target host for HiveServer Interactive (HSI); otherwise, enabling LLAP fails to install HSI. Alternatively, you need to install HiveServer Interactive on the Ambari server as follows:curl -u admin:<password> -H "X-Requested-By:ambari" -i -X POST http://host:8080/api/v1/clusters/<cluster_name>/hosts/<host_name>/host_components/HIVE_SERVER_INTERACTIVE

Steps

Select the Hive service in the Ambari dashboard.
Click the Configs tab.
In the Settings tab, locate the Interactive Query section and set the Enable Interactive Query slider to Yes.

Figure 1.2. Enable Interactive Query Toggle on the Settings Tab
In the Select HiverServer2 Interactive host window, use the HiveServer2 Interactive field to select the server to host the HiveServer2 Interactive instance. In most cases, you can keep the default server host assignment.

Figure 1.3. Select HiveServer2 Interactive Host Window
Select the server to host the HiveServer2 Interactive instance in the HiveServer2 Interactive field. In most cases, you can keep the default server host assignment.
Click Select in the Select HiverServer2 Interactive host window.
When the Settings tab opens again, review the additional configuration fields that appear in the Interactive Query section of the window:

Figure 1.4. Enabled Interactive Query Configuration Settings

Keep llap the Interactive Query Queue drop-down menu. This setting dedicates all the LLAP daemons and all the YARN ApplicationMasters of the system to the single, specified queue.
In the Number of nodes used by Hive LLAP field, enter the number of cluster nodes on which to run Hive LLAP. LLAP automatically deploys to the nodes, and you do not need to label the nodes.
In the Maximum Total Concurrent Queries field, move the slider to the maximum number of concurrent LLAP queries to run. The maximum number of concurrent queries always is the same as the number of active ApplicationMasters in the cluster.

Review the following settings, which are autogenerated for informational purposes only. (No interactive elements allow you to directly change the values.)

Memory per Daemon: YARN container size for each daemon (MB)

In-Memory Cache per Daemon: Size of the cache in each container (MB)

Number of executors per LLAP Daemon: The number of executors per daemon: for example, the number of fragments that can execute in parallel on a daemon

Review the property settings outside the Interactive Query section of the window to learn how the Hive LLAP instance is configured. The Ambari wizard calculates appropriate values for most other properties on the Settings tab, based on the configurations in the Interactive Query section of the window.

	Important
	When enabling Hive LLAP, the Run as end user instead of Hive user slider on the Settings tab has no effect on the Hive instance. If you set the slider to `True`, this property switches from Hive user to end user only when you run Hive in batch-processing mode.

Click the Save button near the top of the Ambari window.
Click Service Actions > Restart All.

Figure 1.5. Restart All in Top Right Corner of Ambari Window

Next Steps

Connect your clients to a dedicated HiveServer2 endpoint.

If query performance is too slow, see the following chapters of this guide.

	Tip
	Hive View 2.0 in Ambari integrates with the general availability release of Hive LLAP. If you plan to use Hive View 2.0 with a Hive LLAP instance, ensure that the Use Interactive Mode property of Manage Ambari Views is set to `true`. See Settings and Cluster Configuration of the Ambari Views Guide..

Enable Hive LLAP: Advanced Setup

About this Task

If you are a more advanced user of Hive LLAP and want to use a customized query queue rather than the default llap queue, then use the following procedure to enable interactive queries.

Prerequisites

Installation of Ambari 2.5.x
The Hive Service and other interdependencies as prompted in Ambari must be running.
Your customized interactive query queue must be set up. For more information, see the Capacity Scheduler chapter of the Hortonworks YARN Resource Management Guide.
Complete the tasks in the Queues for Hive LLAP Sites section.
YARN preemption must be enabled and configured as documented in the Enabling YARN Preemption for Hive LLAP section above.

Steps

Select the Hive service in the Ambari dashboard.
Click the Configs tab.
In the Settings tab, locate the Interactive Query section and set the Enable Interactive Query slider to Yes.

Figure 1.6. Enable Interactive Query Toggle on the Settings Tab
In the Select HiverServer2 Interactive host window, use the HiveServer2 Interactive field to select the server to host the HiveServer2 Interactive instance. In most cases, you can keep the default server host assignment.

Figure 1.7. Select HiveServer2 Interactive Host Window
Select the server to host the HiveServer2 Interactive instance in the HiveServer2 Interactive field. In most cases, you can keep the default server host assignment.
Click Select in the Select HiverServer2 Interactive host window.
When the Settings tab opens again, review the additional configuration fields that appear in the Interactive Query section of the window:

Figure 1.8. Enabled Interactive Query Configuration Settings

In the Interactive Query Queue drop-down menu, select your predefined queue. This setting dedicates all the LLAP daemons and all the YARN ApplicationMasters of the system to the single, specified queue.
In the Number of nodes used by Hive LLAP field, enter the number of cluster nodes on which to run Hive LLAP. LLAP automatically deploys to the nodes, and you do not need to label the nodes.
In the Maximum Total Concurrent Queries field, move the slider to the maximum number of concurrent LLAP queries to run. The maximum number of concurrent queries always is the same as the number of active ApplicationMasters in the cluster.

Review the following settings, which are autogenerated for informational purposes only. (No interactive elements allow you to directly change the values.)

Memory per Daemon: YARN container size for each daemon (MB)

In-Memory Cache per Daemon: Size of the cache in each container (MB)

Number of executors per LLAP Daemon: The number of executors per daemon: for example, the number of fragments that can execute in parallel on a daemon

	Important
	When enabling Hive LLAP, the Run as end user instead of Hive user slider on the Settings tab has no effect on the Hive instance. If you set the slider to `True`, this property switches from Hive user to end user only when you run Hive in batch-processing mode.

Click the Save button near the top of the Ambari window.
Click Service Actions > Restart All.

Figure 1.9. Restart All in Top Right Corner of Ambari Window

Next Step

Connect your clients to a dedicated HiveServer2 endpoint.

Connect Clients to a Dedicated HiveServer2 Endpoint

About this Task

Hortonworks supports Hive JDBC drivers that enable you to connect to HiveServer2 so that you can query, analyze, and visualize data stored in the Hortonworks Data Platform. In this task, you get the autogenerated HiveServer2 JDBC URL so that you can connect your client to the Hive LLAP instance.

	Important
	Do not use Hive CLI as your JDBC client for Hive LLAP queries.

Prerequisite

Complete setup of Hive LLAP with Ambari, including restarting the Hive Service after saving the Enable Interactive Query settings.

Steps

Select the Hive service in the Ambari dashboard.
Click the Summary tab.
Use the clipboard icon to the right of the HiveServer2 JDBC URL values to copy the URLs, but paste only the second URL into a JDBC client (such as a BI tool or Beeline).

Figure 1.10. Summary Tab with the HiveServer2 JDBC URLs

Next Steps

You can run your queries in the client. Hive LLAP should be booted and ready to use.

If query performance is too slow, see the following chapters of this guide.

	Tip
	Hive View 2.0 in Ambari integrates with the general availability release of Hive LLAP. If you plan to use Hive View 2.0 with a Hive LLAP instance, ensure that the Use Interactive Mode property of Manage Ambari Views is set to `true`. See Settings and Cluster Configuration of the Ambari Views Guide..