1. Apache Sqoop

Hortonworks Data Platform deploys Apache Sqoop for your Hadoop cluster.

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

For additional information see the Sqoop documentation, including these sections in the User Guide:

 1.1. Sqoop Connectors

Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems. Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation.

HDP provides the following connectors for Sqoop:

  • Sqoop connector for Teradata: This connector, based on the Teradata Connector for Hadoop, can be downloaded from here.

  • MySQL connector: This connector is provided with the Hortonworks repository and the instructions on configuring this connector are available here.

  • Oracle JDBC connector: The instructions on configuring this connector are available here.

  • Netezza connector: This connector is included in the HDP distribution and installs with Sqoop; see below for more information.

A Sqoop connector for SQL Server is available from Microsoft:

  • SQL Server R2 connector: This connector and its documentation can be downloaded from here.

 1.2. Sqoop Import Table Commands

When connecting to an Oracle database, the Sqoop import command requires case-sensitive table names and usernames (typically uppercase). Otherwise the import fails with error message "Attempted to generate class with no columns!"

Prior to the resolution of SQOOP-741, import-all-tables would fail for an Oracle database. See the JIRA for more information.

The import-all-tables command has additional restrictions. See Chapter 8 in the Sqoop User Guide.

 1.3. Netezza Connector

Netezza connector for Sqoop is an implementation of the Sqoop connector interfaces for accessing a Netezza data warehouse appliance, so that data can be exported and imported to a Hadoop environment from Netezza data warehousing environments.

The HDP Sqoop distribution includes Netezza connector software. To deploy it, the only requirement is that you acquire the JDBC jar file (named nzjdbc.jar) from IBM and copy it to the /usr/local/nz/lib directory.

 1.3.1. Extra Arguments

This table describes extra arguments supported by the Netezza connector:

Table 5.1. Supported Netezza Extra Arguments
Argument Description

--partitioned-access

Whether each mapper acts on a subset of data slices of a table or all.

Default is "false" for standard mode and "true" for direct mode.

--max-errors

Applicable only in direct mode. This option specifies the error threshold per mapper while transferring data. If the number of errors encountered exceeds this threshold, the job fails.

Default value is 1.

--log-dir

Applicable only in direct mode. Specifies the directory where Netezza external table operation logs are stored.

Default value is /tmp.

 1.3.2. Direct Mode

Netezza connector supports an optimized data transfer facility using the Netezza external tables feature. Each map task of Netezza connector’s import job works on a subset of the Netezza partitions and transparently creates and uses an external table to transport data. Similarly, export jobs use the external table to push data fast onto the NZ system. Direct mode does not support staging tables, upsert options, etc.

Direct mode is specified by the --direct Sqoop option.

Here is an example of a complete command line for import using the Netezza external table feature.

$ sqoop import \
    --direct \
    --connect jdbc:netezza://nzhost:5480/sqoop \
    --table nztable \
    --username nzuser \
    --password nzpass \
    --target-dir hdfsdir

Here is an example of a complete command line for export with tab (\t) as the field terminator character.

$ sqoop export \
    --direct \
    --connect jdbc:netezza://nzhost:5480/sqoop \
    --table nztable \
    --username nzuser \
    --password nzpass \
    --export-dir hdfsdir \
    --input-fields-terminated-by "\t"

 1.3.3. Null String Handling

In direct mode the Netezza connector supports the null-string features of Sqoop. Null string values are converted to appropriate external table options during export and import operations.

Table 5.2. Supported Export Control Arguments
Argument Description

--input-null-string <null-string>

The string to be interpreted as null for string columns.

--input-null-non-string <null-string>

The string to be interpreted as null for non-string columns.

In direct mode, both the arguments must either be left to the default values or explicitly set to the same value. Furthermore, the null string value is restricted to 0-4 UTF-8 characters.

On export, for non-string columns, if the chosen null value is a valid representation in the column domain, then the column might not be loaded as null. For example, if the null string value is specified as "1", then on export, any occurrence of "1" in the input file will be loaded as value 1 instead of NULL for int columns.

For performance and consistency, specify the null value as an empty string.

Table 5.3. Supported Import Control Arguments
Argument Description

--null-string <null-string>

The string to be interpreted as null for string columns.

--null-non-string <null-string>

The string to be interpreted as null for non-string columns.

In direct mode, both the arguments must either be left to the default values or explicitly set to the same value. Furthermore, the null string value is restricted to 0-4 UTF-8 characters.

On import, for non-string columns in the current implementation, the chosen null value representation is ignored for non-character columns. For example, if the null string value is specified as "\N", then on import, any occurrence of NULL for non-char columns in the table will be imported as an empty string instead of \N, the chosen null string representation.

For performance and consistency, specify the null value as an empty string.


loading table of contents...