Chapter 6. Using Apache Sqoop

 1. Apache Sqoop

Hortonworks Data Platform deploys Apache Sqoop for your Hadoop cluster.

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

For additional information see the Sqoop documentation, which includes:

 1.1. Sqoop Connectors

Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems. Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation.

A Sqoop connector for SQL Server is available from Microsoft:

  • SQL Server R2 connector: This connector and its documentation can be downloaded from here.

 1.2. Sqoop Import Table Commands

When connecting to an Oracle database, the Sqoop import command requires case-sensitive table names and usernames (typically uppercase). Otherwise the import fails with error message "Attempted to generate class with no columns!"

Prior to the resolution of SQOOP-741, import-all-tables would fail for an Oracle database. See the JIRA for more information.

The import-all-tables command has additional restrictions. See Chapter 8 in the Sqoop User Guide.