Integrating Apache Hive with Spark and BI
Also available as:
PDF

Hive Warehouse Connector for accessing Apache Spark data

You need to understand how to use the Hive Warehouse Connector to access Spark tables from Hive in HDP 3.0 and later. You also export tables to Hive from Spark and vice versa using this connector.

In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms. A table created by Spark resides in the Spark catalog. A table created by Hive resides in the Hive catalog. Databases fall under the catalog namespace, similar to how tables belong to a database namespace. Although independent, these tables interoperate and you can see Spark tables in the Hive catalog, but only when using the Hive Warehouse Connector.

You use the Hive Warehouse Connector API to access any type of table in the Hive catalog from Spark. When you use SparkSQL, standard Spark APIs access tables in the Spark catalog.

Using the Hive Warehouse Connector, you can export tables and extracts from the Spark catalog to Hive and from the Hive catalog to Spark. You export tables and extracts from the Spark catalog to Hive by reading them using Spark APIs and writing them to the Hive catalog using the Hive Warehouse Connector. You export tables and extracts from the Hive catalog to Spark by reading them using the Hive Warehouse Connector and writing them to the Spark catalog using Spark APIs.

Using the Hive Warehouse Connector, you can read and write Apache Spark DataFrames and Streaming DataFrames to and from Apache Hive using low-latency, analytical processing (LLAP). From Spark, you can access managed, ACID tables as well as external tables using the HiveWarehouseConnector.

Apache Ranger and the HiveWarehouseConnector library provide row and column, fine-grained access to Spark data in Hive.

The Hive Warehouse Connector supports the following applications:
  • Spark shell
  • PySpark
  • The spark-submit script
The following list describes a few of the operations supported by the Hive Warehouse Connector:
  • Describing a table
  • Creating a table for ORC-formatted data
  • Selecting Hive data and retrieving a DataFrame
  • Writing a DataFrame to Hive in batch
  • Executing a Hive update statement
  • Reading table data from Hive, transforming it in Spark, and writing it to a new Hive table
  • Writing a DataFrame or Spark stream to Hive using HiveStreaming