Using Apache HiveQL
Also available as:
PDF

Create, use, and drop an external table

You use an external table, which is a table that Hive does not manage, to import data from a file on HDFS, or another file system, into Hive.

In this task, you need access to HDFS to put a CSV file on HDFS. When using storage-based authorization (SBA), if an ACL is not in place that allows you to access HDFS, you need to log in to a node on your cluster as the hdfs user. Alternatively, when using Ranger, you need to be authorized by a policy, such as the default HDFS all-path policy (shown below) to access HDFS.

In this task, you create an external table, store the data in Hive using a managed table, and drop the external table. You create an external table and load data from a file into the table. You then use a Hive managed table to store the data in Hive. This task demonstrates the following Hive principles:

  • A major difference between an external and a managed (internal) table: the persistence of table data on the files system after a DROP TABLE statement.
    • External table drop: Hive drops only the metadata, which consists mainly of the schema definition.
    • Managed table drop: Hive deletes the data and the metadata stored in the Hive warehouse.
  • You can make the external table data available after dropping it by issuing another CREATE EXTERNAL TABLE statement to load the data from the file system.
  • The LOCATION clause in the CREATE TABLE specifies the location of an external table data.
  1. Create a text file named students.csv that contains the following lines.
    1,jane,doe,senior,mathematics
    2,john,smith,junior,engineering               
  2. As root, move the file to /home/hdfs on a node in your cluster, create a directory on HDFS in the users directory called andrena that allows access by all, and put students.csv in the directory.
    • On the command-line of a node on your cluster, enter the following commands:
      sudo su -
      mv students.csv /home/hdfs
      sudo su - hdfs
      hdfs dfs -mkdir /user/andrena
      hdfs dfs -chmod 777 /user/andrena
      hdfs dfs -put /home/hdfs/students.csv /user/andrena
    • Having authorization to HDFS through a Ranger policy, use the command line or Ambari to create the directory and put the students.csv file in the directory.
  3. Type hive on the command line of a node on your cluster to start the Hive shell.
  4. Log into Hive.
  5. Create an external table schema definition that specifies the text format, loads data from students.csv located in /user/andrena.
    CREATE EXTERNAL TABLE IF NOT EXISTS names_text(
      student_ID INT, FirstName STRING, LastName STRING,    
      year STRING, Major STRING)
      COMMENT 'Student Names'
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE
      LOCATION '/user/andrena';
  6. Verify that the Hive warehouse stores the student names in the external table.
    SELECT * FROM names_text;
  7. Create the schema for a managed table.
    CREATE TABLE IF NOT EXISTS Names(
      student_ID INT, FirstName STRING, LastName STRING,    
      year STRING, Major STRING)
      COMMENT 'Student Names'
      STORED AS ORC;
  8. Move the external table data to the managed table.
    INSERT OVERWRITE TABLE Names SELECT * FROM names_text;
  9. Verify that the data from the external table resides in the managed table, and drop the external table, and verify that the data still resides in the managed table.
    SELECT * from Names; 
    DROP TABLE names_text;
    SELECT * from Names;                  
    The results from the managed table Names appears.
  10. Verify that the external table schema definition is lost.
    SELECT * from names_text;
    Selecting all from names_text returns no results because the external table schema is lost. The students.csv file on HDFS containing student names data remains intact.