Integrating Apache Hive with Kafka, Spark, and BI
Also available as:
PDF

Apache Spark-Apache Hive connection configuration

You can configure Spark properties in Ambari to use the Hive Warehouse Connector for accessing ACID table data in Hive.

Prerequisites

You need to use the following software to connect Spark and Hive using the HiveWarehouseConnector library.

  • HDP 3.0 or later
  • Spark2
  • Hive with HiveServer Interactive (HSI)

    The Hive Warehouse Connector (HWC) and low-latency analytical processing (LLAP) are required for certain tasks, as shown in the following table:

    Table 1. Spark Compatibility
    Tasks HWC Required LLAP Required Other Requirement/Comments
    Read Hive managed tables from Spark Yes Yes Ranger ACLs enforced.
    Write Hive managed tables from Spark Yes No Ranger ACLs enforced. Supports ORC only.
    Read Hive external tables from Spark No Only if HWC is used Table must be defined in Spark catalog. Ranger ACLs not enforced.
    Write Hive external tables from Spark No No Ranger ACLs enforced.
    Read Spark tables Yes Yes Ranger ACLs enforced.
    Write Spark tables Yes No Ranger ACLs enforced. Supports ORC only.

    You need low-latency analytical processing (LLAP) in HSI to read ACID, or other Hive-managed tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from Spark. The HWC library internally uses the Hive Streaming API and LOAD DATA Hive commands to write the data. You do not need LLAP to access external tables from Spark with caveats shown in the table above.

Required properties

You must add several Spark properties through spark-2-defaults in Ambari to use the Hive Warehouse Connector for accessing data in Hive. Alternatively, configuration can be provided for each job using --conf.
  • spark.sql.hive.hiveserver2.jdbc.url

    The URL for HiveServer2 Interactive

  • spark.datasource.hive.warehouse.metastoreUri

    The URI for the metastore

  • spark.datasource.hive.warehouse.load.staging.dir

    The HDFS temp directory for batch writes to Hive, /tmp for example

  • spark.hadoop.hive.llap.daemon.service.hosts

    The application name for LLAP service

  • spark.hadoop.hive.zookeeper.quorum

    The ZooKeeper hosts used by LLAP

Set the values of these properties as follows:
  • spark.sql.hive.hiveserver2.jdbc.url

    In Ambari, copy the value from Services > Hive > Summary > HIVESERVER2 INTERACTIVE JDBC URL.

  • spark.datasource.hive.warehouse.metastoreUri

    Copy the value from hive.metastore.uris. In Hive, at the hive> prompt, enter set hive.metastore.uris and copy the output. For example, thrift://mycluster-1.com:9083.

  • spark.hadoop.hive.llap.daemon.service.hosts

    Copy value from Advanced hive-interactive-site > hive.llap.daemon.service.hosts.

  • spark.hadoop.hive.zookeeper.quorum

    Copy the value from Advanced hive-sitehive.zookeeper.quorum.

Other HWC properties

The spark.datasource.hive.warehouse.write.path.strictColumnNamesMapping validates the mapping of columns against those in Hive to alert the user to input errors. Default = true.

Spark on a Kerberized YARN cluster

In Spark client mode on a kerberized Yarn cluster, set the following property: spark.sql.hive.hiveserver2.jdbc.url.principal. This property must be equal to hive.server2.authentication.kerberos.principal. In Ambari, copy the value for this property from Services > Hive > Configs > Advanced > Advanced hive-site hive.server2.authentication.kerberos.principal.