This is a topic that always rise a discussion…
In Hadoop 1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum.
But this is ignored when set on Hadoop 2.
In Hadoop 2 with YARN, we can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU).
This approach is an improvement over that of Hadoop 1, because the administrator no longer has to bundle CPU and memory into a Hadoop-specific concept of a “slot”.
The number of tasks that will be spawned per node:
min( yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb , yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores )
Obtained value will be set on the variable ‘mapreduce.job.maps‘ on the ‘mapred-site.xml‘ file.
Of course, YARN is more dynamic than that, and each job can have unique resource requirements — so in a multitenant cluster with different types of jobs running, the calculation isn’t as straightforward.