S3 and Parallel Processing – DirectFileOutputCommitter


The problem:

While a Hadoop Job is writing output, it will write to a temporary directory:
Task1 –> /unique/temp/directory/task1/file.tmp
Task2 –> /unique/temp/directory/task2/file.tmp

When the tasks finish the execution, will move (commit) the temporary file to a final location.

This schema makes possible the support speculative execution feature on Hadoop.

Moving the task output to its final destination (commit), involves a Rename operation. This rename operation, on a normal filesystem is just a change of pointer in the FS metadata.

Now, as S3 is not a filesystem, rename operations are more costly: it will involve a copy (Put) + Delete operation.

The solution:

In Mapreduce (this behavior can be different for other applications), to avoid these expensive operations, we can change the mapred-site.xml file, “mapred.output.committer.class” property to “org.apache.hadoop.mapred.DirectFileOutputCommitter”, so the the task output directly to it’s final destination.

<property>
  <name>mapred.output.committer.class</name>
  <value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value>
</property>

For this and other useful parallel processing S3 considerations, please have a look here:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors-io.html#emr-troubleshoot-errors-io-1
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s