1. Using Pig to Bulk Load Data Into HBase

Use the following instructions to bulk load data into HBase using Pig:

  1. Prepare the input file.

    For example, consider the sample data.tsv file as shown below:

    row1	c1	c2
    row2	c1	c2
    row3	c1	c2
    row4	c1	c2
    row5	c1	c2
    row6	c1	c2
    row7	c1	c2
    row8	c1	c2
    row9	c1	c2
    row10    c1	c2

  2. Make the data available on the cluster. Execute the following command on your HBase Server machine:

    hadoop fs -put $filename /tmp/

    Using the previous example:

    hadoop fs -put data.tsv /tmp/

  3. Create or register the HBase table in HCatalog. Execute the following command on your HBase Server machine:

    hcat -f $HBase_Table_Name

    For example, for a sample simple.ddl table as shown below:

    CREATE TABLE
    simple_hcat_load_table (id STRING, c1 STRING, c2 STRING)
    STORED BY 'org.apache.hcatalog.hbase.HBaseHCatStorageHandler'
    TBLPROPERTIES (
      'hbase.table.name' = 'simple_hcat_load_table',
      'hbase.columns.mapping' = 'd:c1,d:c2',
      'hcat.hbase.output.bulkMode' = 'true'
    );
    

    Execute the following command:

    hcat -f simple.ddl

  4. Create the import file. For example, create a file named simple.bulkload.pig with the following contents:

    [Note]Note

    This import file uses the data.tsv file and simple.ddl table created previously. Ensure that you modify the contents of this file according to your environment.

    A = LOAD 'hdfs:///tmp/data.tsv' USING PigStorage('\t') AS (id:chararray, c1:chararray, c2:chararray);
    -- DUMP A;
    STORE A INTO 'simple_hcat_load_table' USING org.apache.hcatalog.pig.HCatStorer();
    

  5. Use Pig to populate the HBase table via HCatalog bulkload.

    Continuing with the previous example, execute the following command on your HBase Server machine:

    pig -useHCatalog simple.bulkload.pig