Oozie 4.2 now supports spark-action.
Example job.properties file (configuration tested on EMR 4.2.0):
nameNode=hdfs://172.31.25.17:8020
jobTracker=172.31.25.17:8032
master=local[*]
queueName=default
examplesRoot=examples
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark
(Use the master node internal IP instead of localhost in the nameNode and jobTracker)
Validate oozie workflow xml file:
oozie validate workflow.xml
Example workflow.xml file:
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
<start to='spark-node' />
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker></job-tracker>
<name-node></name-node>
<prepare>
<delete path="/user/${wf:user()}//output-data/spark"/>
</prepare>
<master></master>
<name>Spark-FileCopy</name>
<class>org.apache.oozie.example.SparkFileCopy</class>
<jar>/user/${wf:user()}//apps/spark/lib/oozie-examples.jar</jar>
<arg>/user/${wf:user()}//input-data/text/data.txt</arg>
<arg>/user/${wf:user()}//output-data/spark</arg>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end' />
</workflow-app>
Create the defined structure in HDFS and copy the proper files:
hadoop fs -ls /user/hadoop/examples/apps/spark/ Found 3 items drwxr-xr-x - hadoop hadoop 0 2015-12-18 08:13 /user/hadoop/examples/apps/spark/lib -rw-r--r-- 1 hadoop hadoop 1920 2015-12-18 08:08 /user/hadoop/examples/apps/spark/workflow.xml hadoop fs -put workflow.xml /user/hadoop/examples/apps/spark/ hadoop fs -put /usr/share/doc/oozie-4.2.0/examples/apps/spark/lib/oozie-examples.jar /user/hadoop/examples/apps/spark/lib hadoop fs -mkdir -p /user/hadoop/examples/input-data/text hadoop fs -mkdir -p /user/hadoop/examples/output-data/spark hadoop fs -put /usr/share/doc/oozie-4.2.0/examples/input-data/text/data.txt /user/hadoop/examples/input-data/text/
Run your oozie Job:
oozie job --oozie http://localhost:11000/oozie -config ./job.properties -run
Check oozie job:
oozie job -info 0000004-151203092421374-oozie-oozi-W
Check available sharelib:
$ oozie admin -shareliblist -oozie http://localhost:11000/oozie [Available ShareLib] oozie hive distcp hcatalog sqoop mapreduce-streaming spark hive2 pig
References:
https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html