This is the third post in a series that explores the theme of enabling diverse workloads in YARN. Our introductory post to understand the context around all the new features for diverse workloads as part of YARN in HDP 2.2, and a related post on CPU scheduling.
One of the core responsibilities of YARN is monitoring and limiting resource usage of application containers. When it comes to resource management there are two parts:
Before HDP 2.2, memory has been the only supported resource for applications on YARN for a long time. This means that applications can request for a certain amount of memory resources, the ResourceManager (RM) then finds nodes with memory enough to satisfy resource-requests and schedules containers on those nodes. The NM then brings up the containers on the node, monitors the memory usage and if the memory usage of any container exceeds its allocation, it kills that container. This ensures that containers are always strictly limited to the memory requested.
With HDP 2.2, as we discussed before in a previous post on CPU scheduling, YARN now also supports another type of resource – CPU, represented as vcores. The allocation behavior with respect to CPU is the similar to memory – nodes inform the RM how many vcores they have and the scheduler uses that number to schedule application containers. When it comes to resource isolation and enforcement, the story extends itself a bit – how do we ensure that containers don’t exceed their vcore allocation? What’s stopping an errant container from spawning a bunch of threads and consume all the CPU on the node?
CGroups is a mechanism for aggregating/partitioning sets of processes (tasks in Linux’s nomenclature), and all their children, into hierarchical groups with specialized behaviour. CGroups is a Linux kernel feature and was merged into kernel version 2.6.24. CGroups allows control of various system resources such as memory, CPU, blockio, etc. In addition, CGroups also take care of ensuring that the resource usage of child processes are limited as per the limit set for their parent. From YARN’s perspective, this allows the NM to limit the CPU usage of containers and fits in perfectly with the current YARN’s model of resource isolation and enforcement.
The end-to-end control-flow works as follows:
To enable CGroups in YARN, you must be running Linux kernel version greater than 2.6.24. You must also setup the NM to use the LinuxContainerExecutor which in turn means that you should have the native container-executor binary as part of your install. This binary is used by the LinuxContainerExecutor to launch containers with the appropriate restrictions.
There are some configuration variables that you should be aware of when setting up CGroups. Some of these properties are for setting up the LinuxContainerExecutor while the rest are specifically directed towards setting up CGroups:
Some organizations may wish to run non-YARN processes on the same node as the NM. These may be monitoring agents, HBase servers, etc. How do you share CPU between the containers launched by the NM and these external tasks? You may not want your YARN containers using all the CPU on any given physical machine. For this purpose we’ve added a new configuration property: yarn.nodemanager.resource.percentage-physical-cpu-limit. This variable lets you set the maximum amount of CPU to be used by all the YARN containers put together. This limit is a hard limit and can be used to ensure that YARN containers can run on your nodes without consuming all CPU, and thus leaving enough capacity for non-YARN process.
By default, the CGroups-based container resource restrictions are only soft limits – if there is spare CPU available on a given node, the containers are allowed to use it (potentially far exceeding their allocation). However if an additional application container is launched that needs CPU, existing containers’ CPU usage is scaled back appropriately. This may lead to situations where applications which are CPU bound finish faster when the cluster is either idle or lightly loaded but take longer when the cluster has more load. This behaviour contrasts with memory isolation in YARN – if users don’t size their containers correctly, their containers will be killed potentially failing the applications too. On the other hand, once the containers are sized correctly, applications will run in a predictable manner.
The above behavior can be turned on/off by a config setting – administrators can set yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage to true if they wish to restrict all containers to their specific resource allocation. This sets a hard limit on the container’s CPU usage, adding some predictability in application performance. This setting is also useful when benchmarking your workload to avoid cluster load adding noise to your measurements.
If you are looking for applications that can benefit from using CGroups, Storm on YARN is a really good example of an application that can gain by enabling CGroups. In addition, the DistributedShell application and the MapReduce framework also have support for specifying vcores and both benefit by using CGroups.
Below are a couple of screenshots that show how a CGroups hierarchy looks like when used by YARN.
The following gives the list of all the individual settings in the cgroups.
When running an application, YARN creates a Cgroup for each container naming it after its ContainerID. For e.g, the following is a situation where YARN is running 6 containers of one application in parallel on a node.
In this and the previous companion posts, we’ve covered various features that help enabling diverse workloads on YARN. Specifically, we discussed about scheduling & isolating workloads that require CPU as resource, impact of adding CPU as resource on applications. We hope you found these posts useful!