Cloud Data Access
Also available as:
loading table of contents...

Configuring S3Guard (Technical Preview)

Amazon S3 is designed for eventual consistency. This means that after having written, updated, or deleted data from S3 buckets, there is no guarantee that the changes will be visible right away: there may be a delay in the appearance of newly created objects, updated objects may appear in the previous state, and objects that have been deleted may still appear to exist. Listing files is also unreliable until the consist state is reached.

This may affect the following operations on S3 data:

  • When listing files, newly created objects may not be listed immediately and deleted objects may continue to be listed — which means that your input for data processing may be incorrect. In Hive, Spark, or MapReduce, this could lead to erroneous results. In the worst case, it could potentially lead to data loss at the time of data movement.

  • During an ETL workflow, in a sequence of multiple jobs that form the workflow, the next job is launched soon after the previous job has been completed. Applications such as Oozie rely on marker files to trigger the subsequent workflows. Any delay in the visibility of these files can lead to delays in the subsequent workflows.

  • During existence-guarded path operations, if a deleted file which has the same name as a target path appears in a listing, some actions may unexpectedly fail due to the target path being present — even though the file has already been deleted.

This eventually consistent behavior of S3 can cause seemingly unpredictable results from queries made against it, limiting the practical utility of the S3A connector for use cases where data gets modified.

S3Guard mitigates the issues related to eventual consistency model by using a table on a DynamoDB instance as a consistent metadata store. This guarantees a consistent view of data stored in S3. In addition, S3Guard improves query performance by reducing the number of times S3 needs to be contacted, which significantly reduces the split computation time of the job in a Hadoop cluster.

To configure S3Guard, perform the following tasks:

  1. Create DynamoDB access policy in the IAM console on your AWS account.

  2. Configure S3Guard in the Ambari web UI.