Using HDCloud with Amazon S3

No configuration is required to access Amazon S3 data from clusters created via HDCloud. The S3 authentication setup is automated at the time of cluster creation where by default the "Instance Role" option creates a new AWS role to grant role-based access to Amazon S3. This allows you to access S3 buckets that are part of the AWS account where HDCloud is running.

However, this option does not give you access to private buckets that are not part of your AWS account. If you want to configure access to a private bucket which is not part of the AWS account in which HDCloud is running, use per-bucket authentication configuration.

No credentials are required for reading public S3 buckets. Hadoop will attempt to read these automatically using any configured credentials. It can be configured to explicitly request anonymous access, in which case no credentials need be supplied.

Sample Dataset

If you are looking for data sets to play around, you can use Landsat 8 data sets made available by AWS in a public Amazon S3 bucket called "landsat-pds". For more information, refer to Landsat on AWS.

​Referencing S3 in the URLs

Regardless of which specific Hadoop ecosystem application you are using, you can access data stored in Amazon S3 using the URL starting with the s3a:// prefix followed by bucket name and path to file or directory.

The URL structure is:

s3a://<bucket>/<dir>/<file>

For example, to access a file called "mytestfile" in a directory called "mytestdir", which is stored in a bucket called "mytestbucket", the URL is:

s3a://mytestbucket/mytestdir/mytestfile

The following FileSystem shell commands demonstrate access to a bucket named "mytestbucket":

hadoop fs -ls s3a://mytestbucket/

hadoop fs -mkdir s3a://mytestbucket/testDir

hadoop fs -put testFile s3a://mytestbucket/testFile

hadoop fs -cat s3a://mytestbucket/testFile
test file content

Configuring S3Guard (Technical Preview)

S3Guard is available in HDP 2.6.1 and later.

Amazon S3 is designed for eventual consistency. This means that after having written, updated, or deleted data from S3 buckets, there is no guarantee that the changes will be visible right away: there may be a delay in the appearance of newly created objects, updated objects may appear in the previous state, and objects that have been deleted may still appear to exist. Listing files is also unreliable until the consist state is reached. This may affect the following operations on S3 data:

This eventually consistent behavior of S3 can cause seemingly unpredictable results from queries made against it, limiting the practical utility of the S3A connector for use cases where data gets modified.

S3Guard mitigates the issues related to eventual consistency model by using a table on a DynamoDB instance as a consistent metadata store. This guarantees a consistent view of data stored in S3. In addition, S3Guard improves query performance by reducing the number of times S3 needs to be contacted, which significantly reduces the split computation time of the job in a Hadoop cluster.

To configure S3Guard, perform the following tasks:

  1. Create DynamoDB Access Policy in the IAM console on your AWS account.
  2. Configure S3Guard Using Custom Properties when creating a cluster.

Create DynamoDB Access Policy

First, you must to provide read and write permissions for the DynamoDB table that S3Guard will create and us. To do this, find the S3Access IAM role that was created for you by HDCloud (or your own custom role if you used one) and add a DynamoDB access policy using the following steps:

  1. Log in to your AWS account and navigate to the Identity and Access Management (IAM) console.
  2. In the IAM console, select Roles from the left pane.
  3. Search for the IAM role:

  4. Click on the role.

  5. In the Permissions tab, click Create Role Policy:

  6. Click Select next to the Policy Generator:

  7. Enter:

    Parameter Value
    Effect Allow
    AWS Service Amazon DynamoDB
    Actions All Actions
    Amazon Resource Name (ARM) *

    Your configuration should look similar to:

  8. Click Add Statement.

  9. Click Next Step.
  10. On the "Review Policy" page, review your new policy and then click Apply Policy:

Now the policy will be attached to your IAM role and your cluster will be able to talk to DynamoDB, including creating a table for S3 metadata when S3Guard is configured.

Configure S3Guard Using Custom Properties

You can configure S3Guard when creating a cluster by setting S3Guard configuration parameters in GENERAL CONFIGURATION > Custom Properties section.

Set these configuration parameters for each bucket that you want to "guard". To configure S3Guard for a specific bucket, replace fs.s3a. with the fs.s3a.bucket.bucketname. where "bucketname" is the name of your bucket.

The configuration parameters that you must set to enable S3Guard are:

Parameter Default Value Setting for S3Guard
fs.s3a.metadatastore.impl org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore Set this to “org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore”. This will make use of DynamoDB as a metastore.
fs.s3a.s3guard.ddb.table.create false Set this to “true” to automatically create the DynamoDB table.
fs.s3a.s3guard.ddb.table (Empty value) Enter a name for the table that will be created in DynamoDB for S3Guard. If you leave this blank while setting fs.s3a.s3guard.ddb.table.create to “true”, a separate DynamoDB table will be created for each accessed bucket. For each bucket, the respective S3 bucket name being used as the DynamoDB table name. This may incur additional costs.
fs.s3a.s3guard.ddb.region (Empty value) Set this param to one of the values from AWS. Refer to AWS documentation. The “region” column value needs to be set as this parameter value.
If you leave this blank, the same region as where the S3 bucket is will be used.
fs.s3a.s3guard.ddb.table.capacity.read 500 Specify read capacity for DynamoDB or use the default. You can monitor the DynamoDB workload in the DynamoDB console on AWS portal and adjust the read/write capacities on the fly based on the workload requirements.
fs.s3a.s3guard.ddb.table.capacity.write 100 Specify write capacity for DynamoDB or use the default. You can monitor the DynamoDB workload in the DynamoDB console on AWS portal and adjust the read/write capacities on the fly based on the workload requirements.

The last two parameters are optional. You can monitor the DynamoDB workload in the DynamoDB console on AWS portal and adjust the read/write capacities on the fly based on the workload requirements.

Example

Adding the following custom properties will create a DynamoDB table called “my-table” in the “us-west-2” region (where the "test" bucket is located). The configuration will be valid for a bucket called "test", which means that “my-table” will only be used for storing metadata related to this bucket.

[
  {
  "core-site" : {
    "fs.s3a.bucket.test.metadatastore.impl" : "org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore",
    "fs.s3a.bucket.test.s3guard.ddb.table" : "my-table",
    "fs.s3a.bucket.test.s3guard.ddb.table.create" : true,
    "fs.s3a.bucket.test.s3guard.ddb.region" : "us-west-2"
    }
  }
]

More Documentation

For more information on Amazon S3, refer to Cloud Data Access documentation.
This guide includes information and steps required for configuring, securing, tuning performance, and troubleshooting access to Amazon S3, as well as information and steps related to using Amazon S3 with Hive and Spark and performing intermediate tasks, such as copying data with DistCp.