Cloud Data Access
Also available as:
PDF
loading table of contents...

Chapter 3. Getting Started with Amazon S3

The following table provides an overview of tasks related to configuring and using HDP with S3. Click on the linked topics to get more information about specific tasks.

[Note]Note

If you are looking for data sets to play around, you can use Landsat 8 data sets made available by AWS in a public Amazon S3 bucket called "landsat-pds". For more information, refer to Landsat on AWS.

TaskDescription
Meet the prerequisites

To use S3 storage, you must have:

  1. An AWS account.

  2. One or more S3 buckets on your AWS account. For instructions on how to create a bucket on S3, refer to AWS documentation.

Configure authentication

In order for Hadoop applications to access data stored in your private S3 buckets, you must configure authentication with your Amazon S3 account.

Configure optional features:

You can optionally configure additional features such as bucket-specific settings and S3Guard (Technical Preview) to mitigate the S3 eventual consistency side-effects.

Work with S3 data:

Once you've configured authentication with your S3 bucket(s), you can access S3 data from Hive (via external tables) and Spark, and perform related tasks such as copying data between HDFS and S3 when needed.

Configure server-side encryption

You can optionally work with S3 data that is protected with server-side encryption: SSE-S3, SSE-KMS, or SSE-C.

Improve performance

You can optionally configure and fine-tune performance-related features to optimize HDP performance for specific tasks including accessing S3 data from Hive, Spark, and copying data with DistCp.
TroubleshootRefer to this section if you experience issues while configuring or using S3 with HDP.