3. Sqoop-HCatalog Integration - Hortonworks Data Platform

3.1. HCatalog Background

HCatalog is a table and storage management service for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored: RCFile format, text files, or SequenceFiles.

HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) has been written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat and OutputFormat as well as the SerDe.

The ability of HCatalog to abstract various storage formats is used in providing RCFile (and future file types) support to Sqoop.

3.2. Exposing HCatalog Tables to Sqoop

HCatalog integration with Sqoop is patterned on an existing feature set that supports Avro and Hive tables. Five new command line options are introduced, and some command line options defined for Hive are reused.

3.2.1. New Command Line Options

--hcatalog-database

Specifies the database name for the HCatalog table. If not specified, the default database name ‘default’ is used. Providing the --hcatalog-database option without --hcatalog-table is an error. This is not a required option.

--hcatalog-table

The argument value for this option is the HCatalog tablename.

The presence of the --hcatalog-table option signifies that the import or export job is done using HCatalog tables, and it is a required option for HCatalog jobs.

--hcatalog-home

The home directory for the HCatalog installation. The directory is expected to have a lib subdirectory and a share/hcatalog subdirectory with necessary HCatalog libraries. If not specified, the system environment variable HCAT_HOME will be checked and failing that, a system property hcatalog.home will be checked. If none of these are set, the default value will be used and currently the default is set to /usr/lib/hcatalog. This is not a required option.

--create-hcatalog-table

This option specifies whether an HCatalog table should be created automatically when importing data. By default, HCatalog tables are assumed to exist. The table name will be the same as the database table name translated to lower case. Further described in Automatic Table Creation below.

--hcatalog-storage-stanza

This option specifies the storage stanza to be appended to the table. Further described in Automatic Table Creation below.

3.2.2. Supported Sqoop Hive Options

The following Sqoop options are also used along with the --hcatalog-table option to provide additional input to the HCatalog jobs. Some of the existing Hive import job options are reused with HCatalog jobs instead of creating HCatalog-specific options for the same purpose.

--map-column-hive: This option maps a database column to HCatalog with a specific HCatalog type.
--hive-home: The Hive home location.
--hive-partition-key: Used for static partitioning filter. The partitioning key should be of type STRING. There can be only one static partitioning key.
--hive-partition-value: The value associated with the partition.

3.2.3. Unsupported Sqoop Options

Sqoop Hive options that are not supported with HCatalog jobs:

--hive-import
--hive-overwrite

Other Sqoop options that are not supported with HCatalog jobs:

--direct
--export-dir
--target-dir
--warehouse-dir
--append

3.2.4. Ignored Sqoop Options

All input delimiter options are ignored.

Output delimiters are generally ignored unless either --hive-drop-import-delims or --hive-delims-replacement is used. When the --hive-drop-import-delims or --hive-delims-replacement option is specified, database table columns of type STRING will be post-processed to either remove or replace the delimiters, respectively. (See Delimited Text Formats and Field and Line Delimiter Characters below.) This is only needed if the HCatalog table uses text format.

3.3. Automatic Table Creation

One of the key features of Sqoop is to manage and create the table metadata when importing into Hadoop. HCatalog import jobs also provide for this feature with the option --create-hcatalog-table. Furthermore, one of the important benefits of the HCatalog integration is to provide storage agnosticism to Sqoop data movement jobs. To provide for that feature, HCatalog import jobs provide an option that lets a user specifiy the storage format for the created table.

The option --create-hcatalog-table is used as an indicator that a table has to be created as part of the HCatalog import job.

The option --hcatalog-storage-stanza can be used to specify the storage format of the newly created table. The default value for this option is "stored as rcfile". The value specified for this option is assumed to be a valid Hive storage format expression. It will be appended to the CREATE TABLE command generated by the HCatalog import job as part of automatic table creation. Any error in the storage stanza will cause the table creation to fail and the import job will be aborted.

Any additional resources needed to support the storage format referenced in the option --hcatalog-storage-stanza should be provided to the job either by placing them in $HIVE_HOME\lib or by providing them in HADOOP_CLASSPATH and LIBJAR files.

If the option --hive-partition-key is specified, then the value of this option is used as the partitioning key for the newly created table. Only one partitioning key can be specified with this option.

Object names are mapped to the lowercase equivalents as specified below when mapped to an HCatalog table. This includes the table name (which is the same as the external store table name converted to lower case) and field names.

3.4. Delimited Text Formats and Field and Line Delimiter Characters

HCatalog supports delimited text format as one of the table storage formats. But when delimited text is used and the imported data has fields that contain those delimiters, then the data may be parsed into a different number of fields and records by Hive, thereby losing data fidelity.

For this case, one of these existing Sqoop import options can be used:

--hive-delims-replacement
--hive-drop-import-delims

If either of these options is provided on input, then any column of type STRING will be formatted with the Hive delimiter processing and then written to the HCatalog table.

3.5. HCatalog Table Requirements

The HCatalog table should be created before using it as part of a Sqoop job if the default table creation options (with optional storage stanza) are not sufficient. All storage formats supported by HCatalog can be used with the creation of the HCatalog tables. This makes this feature readily adopt new storage formats that come into the Hive project, such as ORC files.

3.6. Support for Partitioning

The Sqoop HCatalog feature supports the following table types:

Unpartitioned tables
Partitioned tables with a static partitioning key specified
Partitioned tables with dynamic partition keys from the database result set
Partitioned tables with a combination of a static key and additional dynamic partitioning keys

3.7. Schema Mapping

Sqoop currently does not support column name mapping. However, the user is allowed to override the type mapping. Type mapping loosely follows the Hive type mapping already present in Sqoop except that SQL types “FLOAT” and “REAL” are mapped to HCatalog type “float”. In the Sqoop type mapping for Hive, these two are mapped to “double”. Type mapping is primarily used for checking the column definition correctness only and can be overridden with the --map-column-hive option.

All types except binary are assignable to a String type.

Any field of number type (int, shortint, tinyint, bigint and bigdecimal, float and double) is assignable to another field of any number type during exports and imports. Depending on the precision and scale of the target type of assignment, truncations can occur.

Furthermore, date/time/timestamps are mapped to string (the full date/time/timestamp representation) or bigint (the number of milliseconds since epoch) during imports and exports.

BLOBs and CLOBs are only supported for imports. The BLOB/CLOB objects when imported are stored in a Sqoop-specific format and knowledge of this format is needed for processing these objects in a Pig/Hive job or another Map Reduce job.

Database column names are mapped to their lowercase equivalents when mapped to the HCatalog fields. Currently, case-sensitive database object names are not supported.

Projection of a set of columns from a table to an HCatalog table or loading to a column projection is allowed (subject to table constraints). The dynamic partitioning columns, if any, must be part of the projection when importing data into HCatalog tables.

Dynamic partitioning fields should be mapped to database columns that are defined with the NOT NULL attribute (although this is not validated). A null value during import for a dynamic partitioning column will abort the Sqoop job.

3.8. Support for HCatalog Data Types

All the primitive HCatalog types are supported. Currently all the complex HCatalog types are unsupported.

BLOB/CLOB database types are only supported for imports.

3.9. Providing Hive and HCatlog Libraries for the Sqoop Job

To run HCatalog jobs, you must set the environment variable HCAT_HOME to the proper HCatalog location on both the Sqoop client host and the hosts where the Map/Reduce tasks run.

3.10. Examples

Create an HCatalog table, such as:

hcat -e "create table txn(txn_date string, cust_id string, amount float, store_id int)
partitioned by (cust_id string) stored as rcfile;"

Then Sqoop import and export of the "txn" HCatalog table can be invoked as follows:

Import

$SQOOP_HOME\bin\sqoop import --connect <jdbc-url> -table <table-name> --hcatalog-table txn

Export

$SQOOP_HOME\bin\sqoop export --connect <jdbc-url> -table <table-name> --hcatalog-table txn