Running Spark with oozie


Oozie 4.2 now supports spark-action.

Example job.properties file (configuration tested on EMR 4.2.0):

nameNode=hdfs://172.31.25.17:8020 
jobTracker=172.31.25.17:8032 
master=local[*] 
queueName=default 
examplesRoot=examples 
oozie.use.system.libpath=true 
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark

(Use the master node internal IP instead of localhost in the nameNode and jobTracker)

Validate oozie workflow xml file:

oozie validate workflow.xml

Example workflow.xml file:

<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
 <start to='spark-node' />
<action name='spark-node'>
 <spark xmlns="uri:oozie:spark-action:0.1">
 <job-tracker></job-tracker>
 <name-node></name-node>
 <prepare>
 <delete path="/user/${wf:user()}//output-data/spark"/>
 </prepare>
 <master></master>
 <name>Spark-FileCopy</name>
 <class>org.apache.oozie.example.SparkFileCopy</class>
 <jar>/user/${wf:user()}//apps/spark/lib/oozie-examples.jar</jar>
 <arg>/user/${wf:user()}//input-data/text/data.txt</arg>
 <arg>/user/${wf:user()}//output-data/spark</arg>
 </spark>
 <ok to="end" />
 <error to="fail" />
 </action>
<kill name="fail">
 <message>Workflow failed, error
 message[${wf:errorMessage(wf:lastErrorNode())}]
 </message>
 </kill>
 <end name='end' />
 </workflow-app>

Create the defined structure in HDFS and copy the proper files:

hadoop fs -ls /user/hadoop/examples/apps/spark/
Found 3 items
drwxr-xr-x - hadoop hadoop 0 2015-12-18 08:13 /user/hadoop/examples/apps/spark/lib
-rw-r--r-- 1 hadoop hadoop 1920 2015-12-18 08:08 /user/hadoop/examples/apps/spark/workflow.xml

hadoop fs -put workflow.xml /user/hadoop/examples/apps/spark/

hadoop fs -put /usr/share/doc/oozie-4.2.0/examples/apps/spark/lib/oozie-examples.jar /user/hadoop/examples/apps/spark/lib

hadoop fs -mkdir -p /user/hadoop/examples/input-data/text

hadoop fs -mkdir -p /user/hadoop/examples/output-data/spark

hadoop fs -put /usr/share/doc/oozie-4.2.0/examples/input-data/text/data.txt /user/hadoop/examples/input-data/text/

 

Run your oozie Job:

oozie job --oozie http://localhost:11000/oozie -config ./job.properties -run

Check oozie job:

oozie job -info 0000004-151203092421374-oozie-oozi-W

Check available sharelib:

$ oozie admin -shareliblist -oozie http://localhost:11000/oozie
[Available ShareLib] 
oozie 
hive 
distcp 
hcatalog 
sqoop 
mapreduce-streaming 
spark 
hive2 
pig

 

 

References:

https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s