Apache Hadoop YARN: Resilience of YARN Applications across Resource Manager Restart
This is the first post in our series on the motivations and architecture for improvements to the Apache Hadoop YARN’s Resource Manager Restart resiliency. Other in the series are:
- Resilience of Apache YARN Applications across ResourceManager Restart – Phase 1
- Resilience of Apache Apache Hadoop YARN across ResourceManager Restart – Phase 2
Resource Manager (RM) is the central authority of Apache Hadoop YARN for resource management and scheduling. It is responsible for allocation of resources to applications like Hadoop MapReduce jobs, Apache TEZ DAGs, and other applications running atop YARN. Therefore, though applications can continue to perform the scheduled work without interruption, the RM is a potential single point of failure in a YARN cluster, which is not acceptable in an enterprise production environment. To that end, the YARN community set out to plug this gap via various umbrella efforts.
The ultimate goal is to ensure that RM restart or fail-over is completely transparent to the end-users with zero or minimal impact to running applications. To this end, we split the effort into multiple phases
- Phase I: Preserve application-queues: The focus of this phase is on enhancing the system so that ResourceManager can preserve application-queue state into a pluggable persistent state-store and reread that state automatically upon restart. Users should not be required to re-submit the applications. Existing applications in the queues should simply be re-triggered after RM restarts. This does not require YARN to save any running state in the ResourceManager – each application either simply runs from scratch after RM restarts or uses its own recovery mechanism to continue from where it left off.
YARN-128 is the umbrella Apache Hadoop YARN JIRA ticket that tracked this entire effort.
- Phase II: Preserve work of running applications: Adding to the groundwork of phase I, this phase focuses on reconstructing the running state of the previous RM instance. By taking advantage of the recovered application-queues from phase I, combining that information with container-statuses from all the NodeManagers, and pulling together the allocation requests from the running ApplicationMasters in the cluster, RM restarts. Applications are not required to be restarted, as they will just re-sync with newly started RM. Thus no work will be lost due to a RM crash-reboot event.
This effort is still a TBD and is tracked in its entirety under the JIRA ticket YARN-556.
A related effort that takes advantage of the above phases of RM restart and enables a YARN cluster to be highly available is ‘RM-failover':
- RM failover: Tracked at YARN-149, this effort aims at supporting the ability to failover ResourceManager from one instance to another, potentially running on a different machine, for high availability. It involves leader election, transfer of resource-management authority to a newly elected leader, and client re-direction to the new leader.
In the next blog post, we will start with Phase I: application-queue-preserving restart of YARN ResourceManager. And the remaining phases are going to be covered as part of the subsequent posts.