S3 and Parallel Processing – DirectFileOutputCommitter


The problem:

While a Hadoop Job is writing output, it will write to a temporary directory:
Task1 –> /unique/temp/directory/task1/file.tmp
Task2 –> /unique/temp/directory/task2/file.tmp

When the tasks finish the execution, will move (commit) the temporary file to a final location.

This schema makes possible the support speculative execution feature on Hadoop.

Moving the task output to its final destination (commit), involves a Rename operation. This rename operation, on a normal filesystem is just a change of pointer in the FS metadata.

Now, as S3 is not a filesystem, rename operations are more costly: it will involve a copy (Put) + Delete operation.

The solution:

In Mapreduce (this behavior can be different for other applications), to avoid these expensive operations, we can change the mapred-site.xml file, “mapred.output.committer.class” property to “org.apache.hadoop.mapred.DirectFileOutputCommitter”, so the the task output directly to it’s final destination.

<property>
  <name>mapred.output.committer.class</name>
  <value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value>
</property>

For this and other useful parallel processing S3 considerations, please have a look here:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors-io.html#emr-troubleshoot-errors-io-1
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Copy Data with Hive and Spark / Copiar Datos con Hive y Spark


These are two examples of how to copy data from one S3 location to other S3 location. Same operation can be done from S3 to HDFS and vice-versa.

I’m considering that you are able to launch the Hive client or spark-shell client.

Hive:

Using Mapreduce engine or Tez engine:

set hive.execution.engine=mr; 

or

set hive.execution.engine=tez; 
CREATE EXTERNAL TABLE source_table(a_col string, b_col string, c_col string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://mybucket/hive/csv/';

CREATE TABLE destination_table(a_col string, b_col string, c_col string) LOCATION 's3://mybucket/output-hive/csv_1/';

INSERT OVERWRITE TABLE destination_table SELECT * FROM source_table;

Spark:

sc.textFile("s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/segments/1387346051826/warc/").saveAsTextFile("s3://mybucket/spark/bigfiles")

 

If you want to copy data to HDFS, you can also explore s3-dist-cp:

s3DistCP:

s3-dist-cp --src s3://mybucket/hive/csv/ --dest=hdfs:///output-hive/csv_10/

 

Buñuelos Valencianos (de calabaza)


Ingredientes:

  • 1 calabaza mediana (aprox. 800g)
  • 500 gr harina
  • 100 g levadura fresca
  • 1/2 vaso de gaseosa (soda)
  • Agua
  • Aceite para freir (Girasol/Maiz/Oliva)

Pasos:

Pelar, sacar las semillas y hervir la calabaza para obtener un puré fino. Se reserva la mitad del agua donde se ha hervido la calabaza.

Mezclar la harina con la levadura (si la levadura es deshidratada, disolverla antes en agua tibia mas una cucharada de azucar dejandola fermentar unos 10 minutos), agregar el puré de calabaza que hemos hecho y el agua de hervir la calabaza. Unas 3 tazas deberia ser suficiente para lograr el punto. Agregar la soda. Se amasa a mano hasta conseguir una masa blanda y suave.

valencianos_punto_masa

 

Dejar reposar la masa unos 20 minutos para que duplique su tamano. Los buñuelos deben tener un agujerito en el medio que se le puede hacer, sencillamente apretando el pulgar en le centro de la masa.
En una sartén con aceite caliente se van echando poco a poco los buñuelos hasta que se doran. Moderar la temperatura del aceite para que no queden crudos adentro !

valencianos_probando