This is the first post in our series on the motivations and architecture for improvements to the Apache Hadoop YARN’s Resource Manager Restart resiliency. Other in the series are:
Resource Manager (RM) is the central authority of Apache Hadoop YARN for resource management and scheduling. It is responsible for allocation of resources to applications like Hadoop MapReduce jobs, Apache TEZ DAGs, and other applications running atop YARN. Therefore, though applications can continue to perform the scheduled work without interruption, the RM is a potential single point of failure in a YARN cluster, which is not acceptable in an enterprise production environment. To that end, the YARN community set out to plug this gap via various umbrella efforts.
The ultimate goal is to ensure that RM restart or fail-over is completely transparent to the end-users with zero or minimal impact to running applications. To this end, we split the effort into multiple phases
YARN-128 is the umbrella Apache Hadoop YARN JIRA ticket that tracked this entire effort.
This effort is still a TBD and is tracked in its entirety under the JIRA ticket YARN-556.
A related effort that takes advantage of the above phases of RM restart and enables a YARN cluster to be highly available is ‘RM-failover’:
In the next blog post, we will start with Phase I: application-queue-preserving restart of YARN ResourceManager. And the remaining phases are going to be covered as part of the subsequent posts.