The problem:
While a Hadoop Job is writing output, it will write to a temporary directory:
Task1 –> /unique/temp/directory/task1/file.tmp
Task2 –> /unique/temp/directory/task2/file.tmp
When the tasks finish the execution, will move (commit) the temporary file to a final location.
This schema makes possible the support speculative execution feature on Hadoop.
Moving the task output to its final destination (commit), involves a Rename operation. This rename operation, on a normal filesystem is just a change of pointer in the FS metadata.
Now, as S3 is not a filesystem, rename operations are more costly: it will involve a copy (Put) + Delete operation.
The solution:
In Mapreduce (this behavior can be different for other applications), to avoid these expensive operations, we can change the mapred-site.xml file, “mapred.output.committer.class” property to “org.apache.hadoop.mapred.DirectFileOutputCommitter”, so the the task output directly to it’s final destination.
<property> <name>mapred.output.committer.class</name> <value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value> </property>
For this and other useful parallel processing S3 considerations, please have a look here:
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors-io.html#emr-troubleshoot-errors-io-1
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html