Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Commands That May Be Slower with Cloud Object Storage

Some commands tend to be significantly slower with than when invoked against HDFS or other filesystems. This includes renaming files, listing files, find, mv, cp, and rm.

Renaming Files

Unlike in a normal filesystem, renaming a directory in an object store usually takes time at least as proportional to the number of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays. Amazon S3's time to rename is proportional the amount of data being renamed, so the larger the files being worked on, the longer it will take. This can become a significant delay.

We recommend that when using the hadoop fs put and hadoop fs copyFromLocal commands, you set the -doption for a direct upload. For example:

# Upload a file from the cluster filesystem
          hadoop fs -put -d /datasets/example.orc s3a://bucket1/datasets/

          # Upload a file from the local filesystem
          hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket1/datasets/

          # Create a file from stdin
          echo "hello" | hadoop fs -put -d -f - s3a://bucket1/datasets/hello.txt       

Listing Files

Commands which list many files may to be significantly slower with Object Stores, especially those which scan the entire directory tree:

hadoop fs -count s3a://bucket1/
hadoop fs -du s3a://bucket1/     

Our recommendation is to use these sparingly, and avoid when working with buckets/containers with many millions of entries.

Find

The find command can be very slow on a large store with many directories under the path supplied.

# Enumerate all files in the bucket
hadoop fs -find s3a://bucket1/ -print

# List *.txt in the bucket.
# Remember to escape the wildcard to stop the bash shell trying to expand it
hadoop fs -find s3a://bucket1/datasets/ -name \*.txt -print

Rename

In Amazon S3, the time to rename a file depends on its size. The time to rename a directory depends on the number and size of all files beneath that directory. For WASB, GCS and ADLS, the time to rename is proportionly simply to the number of files. If the a rename operation is interrupted, the object store may in an undefined, with some of the source files renamed, others still in their original paths. There may also be duplicate copies of the data.

hadoop fs -mv s3a://bucket1/datasets s3a://bucket/historical 

Copy

The hadoop fs -cp operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and on the bandwidth between the local computer and the object store.

As an example, this copy command will perform the copy by downloading all the data and uploading it again.

hadoop fs -cp \
adl://alice.azuredatalakestore.net/current \
adl://alice.azuredatalakestore.net/historical
[Note]Note

The further the VMs are from the object store, the longer the copy process takes.