How to Plan and Configure YARN and MapReduce 2 in HDP 2.0

As part of HDP 2.0 Beta, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.  This also streamlines MapReduce to do what it does best, process data.  With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.

In this blog post we’ll walk through how to plan for and configure processing capacity in your enterprise HDP 2.0 cluster deployment. This will cover YARN and MapReduce 2. We’ll use an example physical cluster of slave nodes each with 48 GB ram, 12 disks and 2 hex core CPUs (12 total cores).

yarnYARN takes into account all the available compute resources on each machine in the cluster. Based on the available resources, YARN will negotiate resource requests from applications (such as MapReduce) running in the cluster. YARN then provides processing capacity to each application by allocating Containers. A Container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements (memory, cpu etc.).

Configuring YARN

In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so that processing is not constrained by any one of these cluster resources. As a general recommendation, we’ve found that allowing for 1-2 Containers per disk and per core gives the best balance for cluster utilization. So with our example cluster node with 12 disks and 12 cores, we will allow for 20 maximum Containers to be allocated to each node.

Each machine in our cluster has 48 GB of RAM. Some of this RAM should be reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for YARN to use and keep 8 GB for the Operating System. The following property sets the maximum memory YARN can utilize on the node:

In yarn-site.xml

[code]
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>
[/code]

The next step is to provide YARN guidance on how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container. We want to allow for a maximum of 20 Containers, and thus need (40 GB total RAM) / (20 # of Containers) = 2 GB minimum per container:

In yarn-site.xml

[code]
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
[/code]

YARN will allocate Containers with RAM amounts greater than the yarn.scheduler.minimum-allocation-mb.

Configuring MapReduce 2

MapReduce 2 runs on top of YARN and utilizes YARN Containers to schedule and execute its map and reduce tasks.

When configuring MapReduce 2 resource utilization on YARN, there are three aspects to consider:

  1. Physical RAM limit for each Map And Reduce task
  2. The JVM heap size limit for each task
  3. The amount of virtual memory each task will get

You can define how much maximum memory each Map and Reduce task will take. Since each Map and each Reduce will run in a separate Container, these maximum memory settings should be at least equal to or more than the YARN minimum Container allocation.

For our example cluster, we have the minimum RAM for a Container (yarn.scheduler.minimum-allocation-mb) = 2 GB. We’ll thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks Containers.

In mapred-site.xml:

[code]
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
[/code]

Each Container will run JVMs for the Map and Reduce tasks. The JVM heap size should be set to lower than the Map and Reduce memory defined above, so that they are within the bounds of the Container memory allocated by YARN.

In mapred-site.xml:

[code]
<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>
[/code]

The above settings configure the upper limit of the physical RAM that Map and Reduce tasks will use. The virtual memory (physical + paged memory) upper limit for each Map and Reduce task is determined by the virtual memory ratio each YARN Container is allowed. This is set by the following configuration, and the default value is 2.1:

In yarn-site.xml:

[code]
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
[/code]

Thus, with the above settings on our example cluster, each Map task will get the following memory allocations with the following:

  • Total physical RAM allocated = 4 GB
  • JVM heap space upper limit within the Map task Container = 3 GB
  • Virtual memory upper limit = 4*2.1 = 8.2 GB

With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job. In our example cluster, with the above configurations, YARN will be able to allocate on each node up to 10 mappers (40/4) or 5 reducers (40/8) or a permutation within that.

Next Steps

With HDP 2.0 Beta, you can use Apache Ambari to configure YARN and MapReduce 2. Download HDP 2.0 Beta and deploy today!

Categorized by :
YARN

Comments

Brahma Reddy B
|
December 3, 2013 at 2:22 am
|

{quote}As a general recommendation, we’ve found that allowing for 1-2 Containers per disk and per core gives the best balance for cluster utilization. So with our example cluster node with 12 disks and 12 cores, we will allow for 20 maximum Containers to be allocated to each node.{quote}

Yes, This I had also seen in my cluster like allowing for 1-2 Containers per disk and per core gives the best balance for cluster utilization..Same can be done for MapReduce 2 also right..Why we need to run 10 Mappers and 5 reducers..? I mean here also we can run 1-2 containers per disk and per core right…Is there any reason for following..?

{quote}
In our example cluster, with the above configurations, YARN will be able to allocate on each node up to 10 mappers (40/4) or 5 reducers (40/8) or a permutation within that..
{quote}

|
September 2, 2014 at 12:47 pm
|

clang++ -Xclang -fdump-record-layouts may look for the program headers like a regular compiler invocation.

sachin
|
October 15, 2014 at 5:35 am
|

can you please tell me what is the relation between the number of cores of node machine and the number of mappers that can run on that machine
in your case you have 12 core machine and the maximum number of mappers as you told is 10
so i am a little bit confused
please help

Alex McLintock
|
December 18, 2014 at 7:54 am
|

How does this page change if you are running Tez instead of Mapreduce?

Priyadarshini
|
April 12, 2015 at 5:09 am
|

Hi Brahma

I think as per the example configuration map task memory is 4GB (mapreduce.map.memory.mb = 4096 ) and reduce task physical memory is 8GB (mapreduce.reduce.memory.mb = 8192 ). Node Manager’s physical memory is 40GB and that’s the reason there will be a maximum of 10 mappers((40/4) and 5 reducers(40/8). Hope it clears the doubt.

Gaurav
|
September 11, 2015 at 7:44 am
|

If we update below mentioned values what impact will it have on my cluster ?

yarn.nodemanager.vmem-pmem-ratio
10:1

An in what situation we should changes ttje values ?

Rahul
|
September 18, 2015 at 9:30 pm
|

“With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job.”

Ho can I limit this feature so that I can run multiple resource intensive jobs without them waiting for previously running job to finish.

Yogesh
|
October 22, 2015 at 2:19 pm
|

Virtual memory upper limit = 4*2.1 = 8.2 GB

is it not 8.4GB ?
Virtual memory upper limit = 4*2.1 = 8.4 GB

    Rahul
    |
    December 1, 2015 at 10:42 pm
    |

    Good catch

|
November 28, 2015 at 6:09 am
|

How we can allocate max 10 mappers and 5 reducers of total 40 GB.

in the 40GB quota containers will be allocated 2 GB of RAM right?? in that case it wont allocate 10 mappers and 5 reducers right??

Could you please explain?

    Avinash Chandru
    |
    January 27, 2016 at 8:50 pm
    |

    Hello Venkat. I believe it is referred to as that we can allocate 10 mappers OR 5 reducers for a total of 40GB RAM.

      chandra
      |
      January 31, 2016 at 2:03 pm
      |

      Hi Avinash,
      The each data node has 40 GB right for containers + Map/Reduce right. Container will be allocated 2 GB means, is it lower limit for the Mapper/Reducer as they have limit 4GB/8GB respectively right. OR Container are not the Mapper/Reducers?

        Avinash Chandru
        |
        February 2, 2016 at 1:34 am
        |

        First let’s calculate the maximum Map or Reduce task that can run concurrently in a node:
        In the example, they have allocated 2 GB as minimum for a map and reduce task can take. So 40 GB / 2 GB = 20 containers of map tasks or reduce tasks can be run maximum in a node.

        Now let’s calculate minimum Map or Reduce task that can run concurrently in a node:
        (NOTE: Maximum memory that can be used by Container hosting map tasks differs from reduce task as reducers need more memory for aggregation and stuff)
        * Now if we set the maximum memory for map task as 4 GB, No of containers= 40 GB / 4 GB = 10 map tasks.
        * If we set the maximum memory for reduce task as 8 GB. No of containers = 40 GB / 8 GB = 5 reduce tasks.

        Hope this helps, however am going to ask for a discussion on this topic in newly update community

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.