Using Apache HiveQL
Also available as:
PDF

Create, use, and drop an external table

You must use an external table, which is a table that Hive does not manage, to import data from a file on HDFS, or from another file system, into Hive.

This task shows how to create an external table, store the data in Hive using a managed table, and drop the external table. You create an external table and load data from a file into the table. You then use a Hive managed table to store the data in Hive. This task demonstrates that a major difference between an external and a managed (internal) table is the persistence of table data on the files system after a DROP TABLE statement.

  • External table drop: Hive drops only the metadata, which consists mainly of the schema definition.
  • Managed table drop: Hive deletes the data and the metadata stored in the Hive warehouse.

The external table data, which exists on the files system outside Hive, remains on the file system after you drop the external table. You can make the data available once again by issuing another CREATE EXTERNAL TABLE statement to load the data from the file system.

This task also shows how to use the LOCATION clause in the CREATE TABLE statement to specify the location of an external table as shown below; otherwise, Hive uses the location specified by hive.metastore.warehouse.dir. Specifying a database location in the CREATE DATABASE command, for example CREATE DATABASE <managed table db name> LOCATION '<path>' works for managed tables only.

As administrator, you need to add ACLs to allow groups or users to create databases and tables on the file system when using Storage-Based Authorization (SBA).

  1. Create a text file named students.csv that contains the following lines.
    1,jane,doe,senior,mathematics
    2,john,smith,junior,engineering               
  2. Create a directory on HDFS in the users directory called names, and put students.csv in the directory.
    Enter the following Hadoop command on the command-line of a node in your cluster:
    hdfs dfs -mkdir /tmp/names
    hdfs dfs -put students.csv /tmp/names
    hdfs dfs -chmod 777 /tmp/names
  3. Type hive on the command line of a node on your cluster to start the Hive shell.
  4. Create an external table schema definition that specifies the text format, loads data from students.csv located in /tmp/names, and stores the external table in a /hive subdirectory of /tmp/names.
    0:jdbc:hive2://c7402.ambari.apache.org:2181,> CREATE EXTERNAL TABLE IF NOT EXISTS names_text(
      student_ID INT, FirstName STRING, LastName STRING,    
      year STRING, Major STRING)
      COMMENT 'Student Names'
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE
      LOCATION '/tmp/names';
  5. Verify that the student names are stored in the external table.
    0: jdbc:hive2://c7402.ambari.apache.org:2181,> SELECT * FROM names_text;
  6. Create the schema for a managed table.
    0: jdbc:hive2://c7402.ambari.apache.org:2181,> CREATE TABLE IF NOT EXISTS Names(
      student_ID INT, FirstName STRING, LastName STRING,    
      year STRING, Major STRING)
      COMMENT 'Student Names'
      STORED AS ORC;
  7. Verify that the student names are stored in the external table.
    0: jdbc:hive2://c7402.ambari.apache.org:2181,> SELECT * FROM Names;
  8. Move the external table data to the managed table.
    0: jdbc:hive2://c7402.ambari.apache.org:2181,> INSERT OVERWRITE TABLE Names SELECT * FROM names_text;
  9. Verify that the data from the external table resides in the managed table, drop the external table, and the external table schema definition, on the other hand, is lost.
    DROP TABLE Names_text;
    Select * from Names; // Names returned from the managed table
    Select * from names_text; // External table schema definition is lost                 
    The results from the managed table Names appears in the output. The query to select names_text returns no results because the external table schema is lost. The students.csv file on HDFS containing student names data remains intact.