Migrating data
Also available as:
PDF

Create a Sqoop import command

You create a single Sqoop import command that imports data from diverse data sources, such as a relational database, into Hive using Apache Sqoop.

You enter the Sqoop import command on the command line of your Hive cluster to import data from a data source into HDFS and Hive. The import can includes the following information, for example:
  • Database connection information: database URI, database name, and connection protocol, such as jdbc:mysql:
  • The data to import
  • Parallel processing directives for performant data transfer
  • Destination for imported data
  1. Create an import command that specifies the Sqoop connection to the RDBMS.
    • To enter a password for the data source on the command line, use the -P option in the connection string.
    • To specify a file where the password is stored, use the --password-file option.
    Password on command line:
    sqoop import --connect jdbc:mysql://db.foo.com:3306/bar \
    <data to import> \
    --username <username> \
    -P
    Specify password file:
    sqoop import --connect jdbc:mysql://db.foo.com:3306/bar \
    --table EMPLOYEES \
    --username <username> \
    --password-file ${user.home}/.password 
  2. Specify the data to import in the command.
    • Import an entire table.
    • Import a subset of the columns.
    • Import data using a free-form query.
    Entire table:
    sqoop import \
    --connect jdbc:mysql://db.foo.com:3306/bar \
    --table EMPLOYEES
    Subset of columns:
    sqoop import 
    --connect jdbc:mysql://db.foo.com:3306/bar \
    --table EMPLOYEES \
    --columns "employee_id,first_name,last_name,job_title"
    Free-form query to import the latest data:
    sqoop import \
    --connect jdbc:mysql://db.foo.com:3306/bar \
    --table EMPLOYEES \
    --where "start_date > '2018-01-01'"
  3. Optionally, specify write parallelism in the import statement to execute a number of map tasks in parallel:
    • Set mappers: If the source table has a primary key, explicitly set the number of mappers using --num-mappers.
    • Split by: If primary keys are not evenly distributed, provide a split key using --split-by
    • Sequential: If you do not have a primary key or split key, import data sequentially using --num-mappers 1 or --autoreset-to-one-mapper in query.
    • Set mappers:
      sqoop import --connect jdbc:mysql://db.foo.com:3306/bar \
       --table EMPLOYEES \
       --num-mappers 8 \
    • Split by:
      sqoop import --connect jdbc:mysql://db.foo.com:3306/bar \
       --table EMPLOYEES \
       --split-by dept_id
    • Setting mappers evenly splits the primary key range of the source table.
    • Split by evenly splits the data using the split key instead of a primary key.
  4. Specify importing the data into Hive using Hive default delimiters by specifying the import option --hive-import.
  5. Specify the Hive destination of the data.
    • If you think the table does not already exist in Hive, name the table using --hive-table <db>.<table_name> and use the --create-hive-table option.
    • If you want to insert the imported data into an existing Hive external table, name the table using --hive-table <db>.<table_name>. Do not use the --create-hive-table option.
    This command imports the MySQL EMPLOYEES table to a new Hive table named in the default HDFS location /user/hive/warehouse.
    sqoop import --connect jdbc:mysql://db.foo.com:3306/corp \
    --table EMPLOYEES \
    --hive-import \
    --create-hive-table \
    --hive-table mydb.newtable
    This command imports the MySQL EMPLOYEES table to an external table in HDFS.
    sqoop import --connect jdbc:mysql://db.foo.com:3306/corp \
    --table EMPLOYEES \
    --hive-import \
    --hive-table mydb.myexternaltable