The Hadoop archiving tool can be invoked using the following command format:
hadoop archive -archiveName name -p <parent> <src>* <dest>
Where -archiveName
is the name of the archive you would like to create.
The archive name should be given a .har
extension. The
<parent>
argument is used to specify the relative path to the
location where the files will be archived in the HAR.
Example
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
This example creates an archive using /user/hadoop
as the relative
archive directory. The directories /user/hadoop/dir1
and
/user/hadoop/dir2
will be archived in the
/user/zoo/foo.har
archive.
Archiving does not delete the source files. If you would like to delete the input files after creating an archive (to reduce namespace), you must manually delete the source files.
Although the hadoop archive
command can be run from the host file system,
the archive file is created in the HDFS file system -- from directories that exist in
HDFS. If you reference a directory on the host file system rather than in HDFS, you will
get the following error:
The resolved paths set is empty. Please check whether the srcPaths exist, where
srcPaths = [</directory/path>]
To create the HDFS directories used in the preceding example, you would use the following series of commands:
hdfs dfs -mkdir /user/zoo hdfs dfs -mkdir /user/hadoop hdfs dfs -mkdir /user/hadoop/dir1 hdfs dfs -mkdir /user/hadoop/dir2