Chapter 3. Using Apache Hive

Hortonworks Data Platform deploys Apache Hive for your Hadoop cluster.

Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.

Hive provides SQL on Hadoop, enabling users familiar with SQL to query the data. At the same time, Hive's SQL allows programmers who are familiar with the MapReduce framework to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Hive now includes the HCatalog subproject for managing metadata services on your Hadoop cluster. See Using HDP for Metadata Services (HCatalog) for more information.

Hive Documentation

Documentation for Hive release 0.11 can be found in the Hive wiki and javadocs.

The Hive wiki contains documentation organized in these sections:

  • General Information about Hive

  • User Documentation

  • Administrator Documentation

  • Resources for Contributors

Javadocs describe the Hive API.

Hive JIRAs

Issue tracking for Hive bugs and improvements can be found here: Hive JIRAs.

Hive ODBC Driver

Hortonworks provides a Hive ODBC driver that allows you to connect popular Business Intelligence (BI) tools to query, analyze and visualize data stored within the Hortonworks Data Platform.

  • Download the Hortonworks Hive ODBC driver for Linux, Windows, or Mac OS X from the "Add-Ons" for Hortonworks Data Platform 1.3 here.

  • The instructions on installing and using this driver are available here.

Hive Metastore Scripts

Metastore database initialization and upgrade scripts for Hive 0.11 are exactly the same as those for Hive 0.10, because the schema did not change. Script names were not changed to match the new release number.

For example, the script "hive-schema-0.10.0.mysql.sql" initializes a MySQL database for the Hive 0.11 metastore. The next section contains an example of initializing a Postgres database for the Hive 0.11 metastore with a script that has "0.10.0" in the filename.

Using Postgres for the Hive Metastore

If you use a PostgreSQL (Postgres) database for the Hive metastore, follow these setup guidelines:

  • Add this property to the hive-site.xml file:

    <property>
      <name>datanucleus.autoCreateSchema</name>
      <value>false</value>
    </property>
            
  • Initialize the metastore database schema using the bundled metastore schema initialization script found at /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-0.10.0.postgres.sql.

  • To upgrade: For Postgres 9.1 or above, add this parameter to the /var/lib/pgsql/data/postgresql.conf file:

    standard_conforming_strings = off
            

    This setting will only be required for upgrading from the current deployment, which uses Postgres 8.4. The default behavior changes in Postgres 9.1.

Configuring Hive for Integration with HBase

Use the following procedure to integrate Hive with HBase in non-secure mode:

  1. Create a auxlib directory in the following location:

    mkdir /usr/lib/hive/auxlib

  2. Copy the following .jar files into the new directory:

    • /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar

    • /usr/lib/zookeeper/zookeeper-3.4.5.1.3.0.0-107.jar

    • /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar

    • /usr/lib/hbase/lib/guava-11.0.2.jar

    • /usr/lib/hive/lib/protobuf-java-2.4.1.jar

    • /usr/lib/hbase/lib/hadoop-core.jar

  3. Start Hive with the following configuration properties on the command-line interface:

    hive -hiveconf hbase.zookeeper.quorum=zk1,zk2,zk3 -hiveconf hive.zookeeper.client.port=2181 -hiveconf zookeeper.znode.parent=/hbase-unsecure
         -hiveconf hive.hbase.wal.enabled=false;

[Note]Note

This method of Hive-HBase integration does not support secure mode.

Perform the following procedure to verify your configuration:

  1. Launch the HBase shell:

    hbase shell

  2. Create an HBase table named test:

    CREATE 'test,'mstore'

  3. Launch Hive with the required command-line options:

    hive -hiveconf hbase.zookeeper.quorum=zk1,zk2,zk3 -hiveconf hive.zookeeper.client.port=2181 -hiveconf zookeeper.znode.parent=/hbase-unsecure
         -hiveconf hive.hbase.wal.enabled=false;

  4. Create a table in Hive named user:

    CREATE EXTERNAL TABLE user (
     name string,
     id bigint,
     recipient_id bigint,
     type int )
     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
     STORED AS TEXTFILE
     LOCATION '/user/hive/user';

  5. Create a table in Hive named user_insert:

    CREATE EXTERNAL TABLE user_insert (
     name string,
     id bigint,
     recipient_id bigint,
     type int )
     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
     WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,mstore:name,mstore:id,mstore:recipient_id,mstore:type")
     TBLPROPERTIES ("hbase.table.name" = "test");

  6. Execute the following Hive command to insert data from the user table in Hive into the user_insert table in HBase:

    INSERT OVERWRITE TABLE user_insert SELECT * from user;