Data Access
Also available as:
loading table of contents...

Using Sqoop

Sqoop is a tool that enables you to bulk import and export data from a database. You can use Sqoop to import data into HDFS or directly into Hive. However, Sqoop can only import data into Hive as a text file or as a SequenceFile. To use the ORC file format, you must use a two-phase approach: first use Sqoop to move the data into HDFS, and then use Hive to convert the data into the ORC file format as described in the above Steps 3 and 4 of "Moving Data from HDFS to Hive Using an External Table."

For more information on using Sqoop, refer to Using Apache Sqoop to Transfer Bulk Data.

A detailed Sqoop user guide is also available on the Apache web site here.

The process for using Sqoop to move data into Hive is shown in the following diagram:

Figure 2.2. Using Sqoop to Move Data into Hive

Moving Data into Hive Using Sqoop

  1. Specify the source connection information.

    First, you must specify the:

    • database URI ( in the following example)

    • database name (bar)

    • connection protocol (jdbc:mysql:)

    For this example, use the following command:

    sqoop import --connect jdbc:mysql:// --table EMPLOYEES

    If the source database requires credentials, such as a username and password, you can enter the password on the command line or specify a file where the password is stored.

    For example:

    • Enter the password on the command line:

      sqoop import --connect jdbc:mysql:// --table EMPLOYEES --username <username> -P
      Enter password: (hidden)
    • Specify a file where the password is stored:

      sqoop import --connect jdbc:mysql:// --table EMPLOYEES --username <username> --password-file ${user.home}/.password

    More connection options are described in the Sqoop User Guide on the Apache web site.

  2. Specify the data and the parallelism for import:

    1. Specify the data simply.

      Sqoop provides flexibility to specify exactly the data you want to import from the source system:

      • Import an entire table:

        sqoop import --connect jdbc:mysql:// --table EMPLOYEES
      • Import a subset of the columns from a table:

        sqoop import --connect jdbc:mysql:// --table EMPLOYEES --columns "employee_id,first_name,last_name,job_title"
      • Import only the latest records by specifying them with a WHERE clause and then that they be appended to an existing table:

        sqoop import --connect jdbc:mysql:// --table EMPLOYEES  --where "start_date > '2010-01-01'"
        sqoop import --connect jdbc:mysql:// --table EMPLOYEES --where "id > 100000" --target-dir /incremental_dataset --append

      You can also use a free-form SQL statement.

    2. Specify parallelism.

      There are three options for specifying write parallelism (number of map tasks):

      • Explicitly set the number of mappers using --num-mappers. Sqoop evenly splits the primary key range of the source table:

        sqoop import --connect jdbc:mysql:// --table EMPLOYEES --num-mappers 8

        In this scenario, the source table must have a primary key.

      • Provide an alternate split key using --split-by. This evenly splits the data using the alternate split key instead of a primary key:

        sqoop import --connect jdbc:mysql:// --table EMPLOYEES --split-by dept_id

        This method is useful if primary keys are not evenly distributed.

      • When there is not split key or primary key, the data import must be sequential. Specify a single mapper by using --num-mappers 1 or --autoreset-to-one-mapper.

    3. Specify the data using a query.

      Instead of specifying a particular table or columns, you can specify the date with a query. You can use one of the following options:

      • Explicitly specify a split-by column using --split-by and put $ CONDITIONS that Sqoop replaces with range conditions based on the split-by key. This method requires a target directory:

        sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on ( ==
        WHERE $CONDITIONS' --split-by --target-dir /user/foo/joinresults
      • Use sequential import if you cannot specify a split-by column:

        sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on ( == 
        WHERE $CONDITIONS' -m 1 --target-dir /user/foo/joinresults

      To try a sample query without importing data, use the eval option to print the results to the command prompt:

      sqoop eval --connect jdbc:mysql:// --query "SELECT * FROM employees LIMIT 10"
  3. Specify the destination for the data: HDFS or Hive.

    Here is an example of specifying the HDFS target directory:

    sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on ( == 
    WHERE $CONDITIONS' --split-by --target-dir /user/foo/joinresults

    If you can add text data into your Hive table, you can specify that the data be directly added to Hive. Using --hive-import is the primary method to add text data directly to Hive:

    sqoop import --connect jdbc:mysql:// --table EMPLOYEES --hive-import

    This method creates a metastore schema after storing the text data in HDFS.

    If you have already moved data into HDFS and want to add a schema, use the create-hive-table Sqoop command:

    sqoop create-hive-table (generic-args) (create-hive-table-args)

    Additional options for importing data into Hive with Sqoop:

    Table 2.12. Sqoop Command Options for Importing Data into Hive

    Sqoop Command Option


    --hive-home <directory>

    Overrides $HIVE_HOME.


    Imports tables into Hive using Hive's default delimiters if none are explicitly set.


    Overwrites existing data in the Hive table.


    Creates a hive table during the operation. If this option is set and the Hive table already exists, the job will fail. Set to false by default.

    --hive-table <table_name>

    Specifies the table name to use when importing data into Hive.


    Drops the delimiters \n, \r, and \01 from string fields when importing data into Hive.


    Replaces the delimiters \n, \r, and \01 from strings fields with a user-defined string when importing data into Hive.


    Specifies the name of the Hive field on which a sharded database is partitioned.

    --hive-partition-value <value>

    A string value that specifies the partition key for data imported into Hive.

    --map-column-hive <map>

    Overrides the default mapping from SQL type to Hive type for configured columns.