This is the third post in a series exploring recent innovations in the Hadoop ecosystem that are included in Hortonworks Data Platform (HDP) 2.2. In this post, we introduce the theme of supporting rolling upgrades and downgrades of a Hadoop YARN cluster.
HDP 2.2 offers substantial innovations in Apache™ Hadoop YARN, enabling Hadoop users to efficiently store and interact with their data in a single repository, simultaneously using a wide variety of engines. Thematically, YARN in HDP 2.2 encompasses several tracks that we introduced in our first post of this series. In this post, we introduce rolling upgrades and downgrades of a YARN cluster, which we’ll expand on in richer detail in subsequent posts.
For a long time, administrators of a Hadoop YARN cluster would have to schedule site-wide downtime to upgrade cluster components. This meant all the running applications would be lost, greatly affecting workloads and requiring users to watch for potential issues and manually intervene, if needed, to survive the upgrades.
Over time, we enhanced YARN through resilient ResourceManager (RM) restarts, node-level upgrades and more—shielding administrators and users from these burdens. Collectively, our efforts and enhancements enabled rolling upgrades and downgrades of YARN clusters.
Rolling upgrades of a Hadoop cluster have three fundamental requirements:
Exploring the related details of rolling upgrades for the entire Hadoop ecosystem is beyond the scope of this post. In this post, we will assume that side-by-side installation of different versions of YARN has already been taken care of by the installer. We will focus on certain key efforts in rolling upgrades that are directly related to Hadoop YARN—specifically, YARN-666, the umbrella Apache JIRA tracking ticket (note that the ticket is still open for further enhancements).
For the rest of this post, the features applicable to rolling upgrades also apply to rolling downgrades.
The rolling upgrade of any YARN cluster is part of the bigger upgrade of a Hadoop cluster with many other services and components. Here, we give a very brief overview of how a YARN cluster is upgraded. In general, YARN rolling upgrades include the following key steps:
Let’s look at a simple example of a rolling upgrade process.
Image 1 represents a snapshot of a YARN cluster under upgrade with two RMs (HA mode on), four NMs and one running application. At the specific point in time represented, both RMs and one NM have already finished the upgrade process, and thus are running the newer version of the platform. The three remaining NMs are still running the older version.
Note that, with YARN, the platform upgrade is decoupled from the upgrade of the frameworks and applications running in a cluster. In the Image 1 example, the application had been launched with the old version. So even though one NM where this application is running has been updated, all the containers of the application still run the older version of the application/framework. While the platform’s upgrade is in progress, the application (running the older framework version) can run without any disruption.
In a follow-up blog post, we will describe how Apache Ambari helps administrators automatically take care of the rolling upgrade workflow. We’ll now go deeper into some of the development efforts of the community that created this disruption-free upgrade of a YARN cluster.
ResourceManager (RM) is the central authority in YARN for resource management and scheduling. It is responsible for allocating resources to applications like Hadoop MapReduce (MR) jobs, Apache Tez directed acyclic graphs (DAGs), and all other frameworks and applications running atop YARN.
Before HDP 2.1, the RM was a potential single point of failure in a YARN cluster. Multiple community efforts started to address this limitation so that an RM restart or failover would be transparent to end users, with zero or minimal impact to running applications. Here are a few details about these efforts.
This does not require YARN to save any running state in the RM. Rather, each application either simply runs from scratch after RM restarts or uses its own recovery mechanism to continue from where it left off.
Applications no longer need to be restarted, they just re-sync with newly started RM. Thus, no work is lost due to an RM crash/reboot event.
RM failover is a related effort that takes advantage of the two phases of RM restart, plus enables a YARN cluster to be highly available. This is tracked at YARN-149 and was delivered as part of HDP 2.1. The goal was to support failover of RM from one instance to another—potentially running on a different machine—for high availability. It involves leader election, transfer of resource management authority to a newly elected leader, and client redirection to the new leader.
Before HDP 2.2, a YARN cluster was fairly resilient to one or few NM restarts. That is, YARN would identify in due time that a NM died and inform the applications, which would then be expected to deal with node failures. Meanwhile, any running containers of all applications on that machine would be deemed dead and explicitly marked as such in the RM and the AM.
However, when running in a cluster with nodes (and/or racks), failing and rejoining its cluster is a fundamental design assumption of distributed applications built on top of YARN. And this scheme of the forceful termination of work on nodes won’t help for large-scale restarts of NMs in a cluster—specifically, the type that occur during software upgrades.
As mentioned, losing containers during NM restart is a major problem for upgrades. It becomes worse if the upgrade happens on a live cluster full of running applications. Losing in-progress work is somewhat bearable for short applications, but those that run for long periods of time—sometimes days—are severely impacted at these upgrade points.
Starting with HDP 2.2 and continuing through community efforts in Apache Hadoop 2.6, NMs now have a mode that constantly records important container lifecycle-related information. Because of this persistency, even if NMs restart during an upgrade process, the restarted slaves simply reload the local lifecycle state, reconnect back to the still-alive containers (usually the process trees) and happily proceed as if they are just waking up from a good sleep. This is a major enabler of rolling upgrades. Essentially, individual machines in a cluster can be upgraded without losing any work and without any visible impact to the running applications.
In addition to improving the resilience of platform daemons across restart, HDP 2.2 also comes with infrastructure changes that enable upgrading clusters in a rolling fashion. A few of these important changes:
We now turn our attention from the platform to the land of YARN applications.
One of the big challenges for MapReduce (MR) during rolling upgrades and downgrades is maintaining the consistency of runtime libraries used by the MR tasks launched while an upgrade is underway. Before HDP 2.2, all MR applications depended upon local installation of MR libraries on all the cluster machines. When rolling upgrades or downgrades start in a cluster, the local installation versions of MR libraries will definitely change while an MR job in in progress. This will cause tasks launched on either side of a cluster upgrade to potentially see different versions of the libraries, resulting in inconsistent behavior and possibly runtime errors.
To solve this problem, MR in HDP 2.2 uses the well-known distributed-cache feature of YARN to deploy MR libraries. HDP 2.2 comes with a prebuilt package (a tarball) that contains all the jars required by MR applications. This tarball is copied to the Apache Hadoop Distributed File System (HDFS) during deployment. Think of this process as installing a versioned framework on YARN via HDFS instead of cluster nodes. When launching a MR job, YARN’s distributed cache ensures that the tarball is localized and expanded on each NM running the job’s tasks. The MR framework client points all MR jobs to the localized libraries, and every launched task uses the localized jars. This way, all the MR tasks for the same job will see a consistent version of MR runtime libraries, irrespective of the version the platform corresponds to on any node.
As with MR, any YARN application is also affected by the support of YARN rolling upgrades in HDP 2.2. In order to run on YARN, even if an application doesn’t have a story for its own upgrades or rolling upgrades, it should still be able to run smoothly in the presence of rolling upgrades of the underlying platform. In this context, YARN applications may have to deal with a couple issues:
Similar to MR, applications in general should not make the assumption that runtime libraries, such as HDFS client libraries on each node, are always the same during an application’s entire execution. A simple solution is to package all runtime libraries as part of the app, itself.
Configuration files on different nodes may also become inconsistent while a rolling upgrade is in progress. Any configuration file, whether platform configs (e.g., core-site.xml) or application configs (e.g., tez-site.xml), may change during a rolling upgrade if present on the cluster nodes. To overcome this problem, rather than relying on the config files installed on the cluster directly, we suggest using the distributed cache to propagate the job configuration files from the client gateway machines.
It is also worth noting that if an application needs to support rolling upgrades, it should either be installed to all cluster machines in versioned directories or it should use the YARN distributed cache (similar to MR) to localize its framework.
In this post, we introduced the basics of rolling upgrades and downgrades in YARN. In subsequent posts, we will dive deeper into the goals, design and architectural details of each of these efforts. Rolling upgrades and downgrades can make the task of moving to HDP 2.2 a smooth experience for administrators and users, alike.