Create, use, and drop an external table
You use an external table, which is a table that Hive does not manage, to import data from a file on HDFS, or another file system, into Hive.
In this task, you need access to HDFS to put a CSV file on HDFS. When using storage-based authorization (SBA), if an ACL is not in place that allows you to access HDFS, you need to log in to a node on your cluster as the hdfs user. Alternatively, when using Ranger, you need to be authorized by a policy, such as the default HDFS all-path policy (shown below) to access HDFS.
In this task, you create an external table, store the data in Hive using a managed table, and drop the external table. You create an external table and load data from a file into the table. You then use a Hive managed table to store the data in Hive. This task demonstrates the following Hive principles:
- A major difference between an external and a managed (internal) table: the
persistence of table data on the files system after a
- External table drop: Hive drops only the metadata, which consists mainly of the schema definition.
- Managed table drop: Hive deletes the data and the metadata stored in the Hive warehouse.
- You can make the external table data available after dropping it by issuing another CREATE EXTERNAL TABLE statement to load the data from the file system.
- The LOCATION clause in the CREATE TABLE specifies the location of an external table data.
Create a text file named students.csv that contains the
As root, move the file to /home/hdfs on a node in your
cluster, create a directory on HDFS in the users directory
called andrena that allows access by all, and put
students.csv in the directory.
- On the command-line of a node on your cluster, enter the following
sudo su - mv students.csv /home/hdfs sudo su - hdfs hdfs dfs -mkdir /user/andrena hdfs dfs -chmod 777 /user/andrena hdfs dfs -put /home/hdfs/students.csv /user/andrena
- Having authorization to HDFS through a Ranger policy, use the command line or Ambari to create the directory and put the students.csv file in the directory.
- On the command-line of a node on your cluster, enter the following commands:
hiveon the command line of a node on your cluster to start the Hive shell.
- Log into Hive.
Create an external table schema definition that specifies the text format,
loads data from students.csv located in
CREATE EXTERNAL TABLE IF NOT EXISTS names_text( student_ID INT, FirstName STRING, LastName STRING, year STRING, Major STRING) COMMENT 'Student Names' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/andrena';
Verify that the Hive warehouse stores the student names in the external
SELECT * FROM names_text;
Create the schema for a managed table.
CREATE TABLE IF NOT EXISTS Names( student_ID INT, FirstName STRING, LastName STRING, year STRING, Major STRING) COMMENT 'Student Names' STORED AS ORC;
Move the external table data to the managed table.
INSERT OVERWRITE TABLE Names SELECT * FROM names_text;
Verify that the data from the external table resides in the managed table, and
drop the external table, and verify that the data still resides in the managed
SELECT * from Names; DROP TABLE names_text; SELECT * from Names;The results from the managed table Names appears.
Verify that the external table schema definition is lost.
SELECT * from names_text;Selecting all from
names_textreturns no results because the external table schema is lost. The students.csv file on HDFS containing student names data remains intact.