Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Limitations of the S3A Committers

Custom file output formats and their committers

Output formats which implement their own committers do not automatically switch to the new committers. If such a custom committer relies on renaming files to commit output, then it will depend on S3Guard for a consistent view of the object store, and take time to commit output proportional to the amount of data and the number of files.

To determine if this is the case, find the subclass of org.apache.hadoop.mapreduce.lib.output.FileOutputFormat which implements the custom format, to see if it subclasses thegetOutputCommitter() to return its own committer, or has a custom subclass oforg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.

It may be possible to migrate such a committer to support store-specific committers, as was done for Apache Parquet support in Spark. Here a subclass of Parquet's ParquetOutputCommitter was implemented to delegates all operations to the real committer.

MapReduce V1 API Output Format and Committers

Only the MapReduce V2 APIs underorg.apache.hadoop.mapreduce support the new committer binding mechanism. The V1 APIs under the org.apache.hadoop.mapred package only bind to the file committer and subclasses. The v1 APIs date from Hadoop 1.0 and should be considered obsolete. Please migrate to the v2 APIs, not just for the new committers, but because the V2 APIs are still being actively developed and maintained.

No Hive Support

There is currently no Hive support for the S3A committers. To safely use S3 as a destination of Hive work, you must use S3Guard.