Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Limitations of Amazon S3

Even though Hadoop's S3A client can make an S3 bucket appear to be a Hadoop-compatible filesystem, it is still an object store, and has some limitations when acting as a Hadoop-compatible filesystem. The key things to be aware of are:

  • Operations on directories are potentially slow and non-atomic.

  • Not all file operations are supported. In particular, some file operations needed by Apache HBase are not available — so HBase cannot be run on top of Amazon S3.

  • Data is not visible in the object store until the entire output stream has been written.

  • Amazon S3 is eventually consistent. Objects are replicated across servers for availability, but changes to a replica take time to propagate to the other replicas; the object store is inconsistent during this process. The inconsistency issues surface when listing, reading, updating, or deleting files. To mitigate the inconsistency issues, you can configure S3Guard. To learn more, refer to Using S3Guard for Consistent S3 Metadata.

  • Neither the per-file and per-directory permissions supported by HDFS nor its more sophisticated ACL mechanism are supported.

  • Bandwidth between your workload clusters and Amazon S3 is limited and can vary significantly depending on network and VM load.

For these reasons, while Amazon S3 can be used as the source and store for persistent data, it cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS, or be used as defaultFS.