Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Using DistCp with S3

When using DistCp with data in S3, consider the following limitations:

  • The -append option is not supported.

  • The -diff option is not supported.

  • The -atomic option causes a rename of the temporary data, so significantly increases the time to commit work at the end of the operation. Furthermore, as S3A does not offer atomic renames of directories, the -atomic operation doesn't actually deliver what is promised. Avoid using this option.

  • All -p options, including those to preserve permissions, user and group information, attributes checksums, and replication are ignored.

  • CRC checking between HDFS and S3 will not be performed. We do still recommend using the -skipcrccheck option to make clear that this is taking place, and so that if etag checksums are enabled on S3A through the property fs.s3a.etag.checksum.enabled, then DistCp between HDFS and S3 will not not trigger checksum-mismatch errors.