Tutorial: Working with Data on Amazon S3

This is tutorial will help you get started accessing data stored on Amazon S3 from a cluster created through Hortonworks Data Cloud for AWS. The tutorial assumes no prior experience with AWS.

Overview

In this tutorial:

Let's get started!

Prerequisites

Before starting this tutorial, your cloud controller needs to be running, and you must have a cluster running on AWS. To set up the cloud controller and the cluster, refer to this tutorial. this tutorial includes copying data to and from buckets in the Oregon region (It is recommended, although not required that your cluster is in that region).

Accessing HDFS in HDCloud for AWS

  1. SSH to a cluster node.

    You can copy the SSH information from the cloud controller UI:

  2. In HDCloud clusters, after you SSH to a cluster node, the default user is cloudbreak. The cloudbreak user doesn’t have write access to HDFS, so let’s create a directory to which we will copy the data, and then let’s change the owner and permissions so that the cloudbreak user can write to the directory:

    sudo -u hdfs hdfs dfs -mkdir /user/cloudbreak
    sudo -u hdfs hdfs dfs -chown cloudbreak /user/cloudbreak
    sudo -u hdfs hdfs dfs -chmod 700 /user/cloudbreak

Now you will be able to copy data to the newly created directory.

Copying from S3 to HDFS

We will copy the scene_list.gz file from a public S3 bucket called landsat-pds to HDFS:

  1. First, let’s check if the scene_list.gz file that we are trying to copy exists in the S3 bucket:

    hadoop fs -ls s3a://landsat-pds/scene_list.gz

  2. You should see something similar to:

    -rw-rw-rw- 1 cloudbreak 33410181 2016-11-18 17:16 s3a://landsat-pds/scene_list.gz

  3. Next, let's copy scene_list.gz to your current directory using the following command:

    hadoop distcp s3a://landsat-pds/scene_list.gz .

  4. You should see something similar to:

    [cloudbreak@ip-10-0-1-208 ~]$ hadoop distcp s3a://landsat-pds/scene_list.gz .
    16/11/18 22:00:50 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://landsat-pds/scene_list.gz], targetPath=, targetPathExists=true, filtersFile='null'}
    16/11/18 22:00:51 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
    16/11/18 22:00:51 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
    16/11/18 22:00:51 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
    16/11/18 22:00:53 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
    16/11/18 22:00:53 INFO tools.SimpleCopyListing: Build file listing completed.
    16/11/18 22:00:53 INFO tools.DistCp: Number of paths in the copy list: 1
    16/11/18 22:00:53 INFO tools.DistCp: Number of paths in the copy list: 1
    16/11/18 22:00:53 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
    16/11/18 22:00:53 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
    16/11/18 22:00:53 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
    16/11/18 22:00:53 INFO mapreduce.JobSubmitter: number of splits:1
    16/11/18 22:00:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479498757313_0009
    16/11/18 22:00:54 INFO impl.YarnClientImpl: Submitted application application_1479498757313_0009
    16/11/18 22:00:54 INFO mapreduce.Job: The url to track the job: http://ip-10-0-1-208.ec2.internal:8088/proxy/application_1479498757313_0009/
    16/11/18 22:00:54 INFO tools.DistCp: DistCp job-id: job_1479498757313_0009
    16/11/18 22:00:54 INFO mapreduce.Job: Running job: job_1479498757313_0009
    16/11/18 22:01:01 INFO mapreduce.Job: Job job_1479498757313_0009 running in uber mode : false
    16/11/18 22:01:01 INFO mapreduce.Job: map 0% reduce 0%
    16/11/18 22:01:11 INFO mapreduce.Job: map 100% reduce 0%
    16/11/18 22:01:11 INFO mapreduce.Job: Job job_1479498757313_0009 completed successfully
    16/11/18 22:01:11 INFO mapreduce.Job: Counters: 38
    File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=145318
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=349
    HDFS: Number of bytes written=33410189
    HDFS: Number of read operations=13
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=4
    S3A: Number of bytes read=33410181
    S3A: Number of bytes written=0
    S3A: Number of read operations=3
    S3A: Number of large read operations=0
    S3A: Number of write operations=0
    Job Counters
    Launched map tasks=1
    Other local map tasks=1
    Total time spent by all maps in occupied slots (ms)=8309
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=8309
    Total vcore-milliseconds taken by all map tasks=8309
    Total megabyte-milliseconds taken by all map tasks=8508416
    Map-Reduce Framework
    Map input records=1
    Map output records=0
    Input split bytes=121
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=54
    CPU time spent (ms)=3520
    Physical memory (bytes) snapshot=281440256
    Virtual memory (bytes) snapshot=2137710592
    Total committed heap usage (bytes)=351272960
    File Input Format Counters
    Bytes Read=228
    File Output Format Counters
    Bytes Written=8
    org.apache.hadoop.tools.mapred.CopyMapper$Counter
    BYTESCOPIED=33410181
    BYTESEXPECTED=33410181
    COPY=1
    [cloudbreak@ip-10-0-1-208 ~]$
    

  5. Now let’s check if the file that we just copied is present in the cloudbreak directory:

    hadoop fs -ls

  6. You should see something similar to:

    -rw-r--r-- 3 cloudbreak hdfs 33410181 2016-11-18 21:30 scene_list.gz

Congratulations! You’ve successfully copied a file from an S3 bucket to HDFS!

Creating an S3 Bucket

In this step, we will copy the scene_list.gz file from the cloudbreak directory to an S3 bucket. But before that, we need to create a new S3 bucket.

  1. In your browser, navigate to the S3 Dashboard https://console.aws.amazon.com/s3/home.

  2. Click on Create Bucket and create a bucket:

    For example, here I am creating a bucket called “domitest”. Since my cluster and source data are in the Oregon region, I am creating this bucket in that region.

  3. Next, navigate to the newly created bucket, and create a folder:

    For example, here I am creating a folder called “demo”.

  4. Now, from our cluster node, let’s check if the bucket and folder that we just created exist:

    hadoop fs -ls s3a://domitest/

  5. You should see something similar to:

    Found 1 items drwxrwxrwx - cloudbreak 0 2016-11-18 22:17 s3a://domitest/demo

Congratulations! You’ve successfully created an Amazon S3 bucket.

Copying from HDFS to S3

  1. Now let’s copy the scene_list.gz file from HDFS to this newly created bucket:

    hadoop distcp /user/cloudbreak/scene_list.gz s3a://domitest/demo

  2. You should see something similar to:

    [cloudbreak@ip-10-0-1-208 ~]$ hadoop distcp /user/cloudbreak/scene_list.gz s3a://domitest/demo
    16/11/18 22:20:32 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/user/cloudbreak/scene_list.gz], targetPath=s3a://domitest/demo, targetPathExists=true, filtersFile='null'}
    16/11/18 22:20:33 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
    16/11/18 22:20:33 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
    16/11/18 22:20:33 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
    16/11/18 22:20:34 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
    16/11/18 22:20:34 INFO tools.SimpleCopyListing: Build file listing completed.
    16/11/18 22:20:34 INFO tools.DistCp: Number of paths in the copy list: 1
    16/11/18 22:20:34 INFO tools.DistCp: Number of paths in the copy list: 1
    16/11/18 22:20:34 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-1-208.ec2.internal:8188/ws/v1/timeline/
    16/11/18 22:20:34 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-1-208.ec2.internal/10.0.1.208:8050
    16/11/18 22:20:34 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-1-208.ec2.internal/10.0.1.208:10200
    16/11/18 22:20:34 INFO mapreduce.JobSubmitter: number of splits:1
    16/11/18 22:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479498757313_0010
    16/11/18 22:20:35 INFO impl.YarnClientImpl: Submitted application application_1479498757313_0010
    16/11/18 22:20:35 INFO mapreduce.Job: The url to track the job: http://ip-10-0-1-208.ec2.internal:8088/proxy/application_1479498757313_0010/
    16/11/18 22:20:35 INFO tools.DistCp: DistCp job-id: job_1479498757313_0010
    16/11/18 22:20:35 INFO mapreduce.Job: Running job: job_1479498757313_0010
    16/11/18 22:20:42 INFO mapreduce.Job: Job job_1479498757313_0010 running in uber mode : false
    16/11/18 22:20:42 INFO mapreduce.Job: map 0% reduce 0%
    16/11/18 22:20:53 INFO mapreduce.Job: map 100% reduce 0%
    16/11/18 22:21:01 INFO mapreduce.Job: Job job_1479498757313_0010 completed successfully
    16/11/18 22:21:01 INFO mapreduce.Job: Counters: 38
    File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=145251
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=33410572
    HDFS: Number of bytes written=8
    HDFS: Number of read operations=10
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
    S3A: Number of bytes read=0
    S3A: Number of bytes written=33410181
    S3A: Number of read operations=14
    S3A: Number of large read operations=0
    S3A: Number of write operations=4098
    Job Counters
    Launched map tasks=1
    Other local map tasks=1
    Total time spent by all maps in occupied slots (ms)=14695
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=14695
    Total vcore-milliseconds taken by all map tasks=14695
    Total megabyte-milliseconds taken by all map tasks=15047680
    Map-Reduce Framework
    Map input records=1
    Map output records=0
    Input split bytes=122
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=57
    CPU time spent (ms)=4860
    Physical memory (bytes) snapshot=280420352
    Virtual memory (bytes) snapshot=2136977408
    Total committed heap usage (bytes)=350748672
    File Input Format Counters
    Bytes Read=269
    File Output Format Counters
    Bytes Written=8
    org.apache.hadoop.tools.mapred.CopyMapper$Counter
    BYTESCOPIED=33410181
    BYTESEXPECTED=33410181
    COPY=1
    

  3. Next, let’s check if the file that we copied is present in the cloudbreak directory:

    hadoop fs -ls s3a://domitest/demo

  4. You should see something similar to:

    Found 1 items -rw-rw-rw- 1 cloudbreak 33410181 2016-11-18 22:20 s3a://domitest/demo/scene_list.gz

  5. You will also see the file appear on the S3 Dashboard:

Congratulations! You’ve successfully copied the file from HDFS to the S3 bucket!

Next Steps

  1. Try creating another bucket. Using similar syntax, you can try copying files between two S3 buckets that you created.
  2. If you want to copy more files, try adding the -D fs.s3a.fast.upload=true and see how this accelerates the transfer. Click here for more information. Note that when working with S3 buckets, copying whole directories is very slow and inefficient - so make sure to copy files rather than directories.
  3. Try running more hadoop fs commands listed here.
  4. Learn more about the landat-pds bucket at https://pages.awscloud.com/public-data-sets-landsat.html.

Cleaning Up

Any files stored on S3 or in HDFS add to your charges, so you should get into the habit of getting rid of the files.

  1. To delete the scene_list.gz file from HFDS, run:
    hadoop fs -rm -skipTrash /user/cloudbreak/scene_list.gz

  2. To delete the scene_list.gz file from the S3 bucket, run:
    hadoop fs -rm -skipTrash s3a://domitest/demo/scene_list.gz
    Or, you can delete it from the S3 Dashboard.