yarn: change configuration and restart node manager on a live cluster


This procedure is to change Yarn configuration on a live cluster, propagate the changes to all the nodes and restart Yarn node manager.

Both commands are listing all the nodes on the cluster and then filtering the DNS name to execute a remote command via SSH. You can customize the sed filter depending on your own needs. This is filtering DNS names with Elastic Mapreduce format (ip-xx-xx-xx-xx.eu-west-1.compute.internal).

1. Upload the private key (.pem) file you are using to access the master node on the cluster. Change the private key permissions to at least 600 (i.e chmod 600 MyKeyName.pem)

2.  Change /conf/yarn-site.xml and use a command like this to populate the change across the cluster.

yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 scp -o StrictHostKeyChecking=no -i ~/MyKeyName.pem ~/conf/yarn-site.xml hadoop@{}://home/hadoop/conf/

3. This command will restart Yarn Node Resource manager on all the nodes.

 yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 ssh -o StrictHostKeyChecking=no -i ~/MyKeyName.pem hadoop@{} "yarn nodemanager stop"

 

Hadoop 1 vs Hadoop 2 – How many slots do I have per node ?


This is a topic that always rise a discussion…

In Hadoop 1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum.

But this is ignored when set on Hadoop 2.

In Hadoop 2 with YARN, we can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU).

This approach is an improvement over that of Hadoop 1, because the administrator no longer has to bundle CPU and memory into a Hadoop-specific concept of a “slot”.

The number of tasks that will be spawned per node:

min(
    yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb
    ,
    yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores
    )

Obtained value will be set on the variable ‘mapreduce.job.maps‘ on the ‘mapred-site.xml‘ file.

Of course, YARN is more dynamic than that, and each job can have unique resource requirements — so in a multitenant cluster with different types of jobs running, the calculation isn’t as straightforward.

More information:
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

Hadoop useful commands


– Copy fromLocal/ToLocal from/to S3:

$ bin/hadoop fs -copyToLocal s3://my-bucket/myfile.rb /home/hadoop/myfile.rb
$ bin/hadoop fs -copyFromLocal job5.avro s3://my-bucket/input

– Merge all the files from one folder into one single file:

$ hadoop jar ~/lib/emr-s3distcp-1.0.jar --src s3://my-bucket/my-folder/ --dest s3://my-bucket/logs/all-the-files-merged.log --groupBy '.*(*)' --outputCodec none

– Create directory on HDFS:

$ bin/hadoop fs -mkdir -p /user/ubuntu

– List HDFS directory:

bin/hadoop fs -ls /

– Put a file in HDFS:

bin/hadoop dfs -put localfile.txt /user/hadoop/hadoopfile

– Check HDFS filesystem utilization:

$ bin/hadoop dfsadmin -report

– Cat of file on HDFS:

$ bin/hadoop  dfs -cat /user/ubuntu/RESULTS/part-00000

More commands:

http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html