Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
February 25, 2015
prev slideNext slide

Introducing Rolling Upgrades and Downgrades of a Apache Hadoop® YARN Cluster

This is the third post in a series exploring recent innovations in the Hadoop ecosystem that are included in Hortonworks Data Platform (HDP) 2.2. In this post, we introduce the theme of supporting rolling upgrades and downgrades of a Hadoop YARN cluster.

HDP 2.2 offers substantial innovations in Apache™ Hadoop YARN, enabling Hadoop users to efficiently store and interact with their data in a single repository, simultaneously using a wide variety of engines. Thematically, YARN in HDP 2.2 encompasses several tracks that we introduced in our first post of this series. In this post, we introduce rolling upgrades and downgrades of a YARN cluster, which we’ll expand on in richer detail in subsequent posts.

Avoid downtime during an upgrade

For a long time, administrators of a Hadoop YARN cluster would have to schedule site-wide downtime to upgrade cluster components. This meant all the running applications would be lost, greatly affecting workloads and requiring users to watch for potential issues and manually intervene, if needed, to survive the upgrades.

Over time, we enhanced YARN through resilient ResourceManager (RM) restarts, node-level upgrades and more—shielding administrators and users from these burdens. Collectively, our efforts and enhancements enabled rolling upgrades and downgrades of YARN clusters.

Rolling upgrades of a Hadoop cluster have three fundamental requirements:

  1. Side-by-side installation of different versioned software on each node
  2. Restartability and recoverability of individual Hadoop components
  3. High availability wherever applicable, so clients can failover to available components

Exploring the related details of rolling upgrades for the entire Hadoop ecosystem is beyond the scope of this post. In this post, we will assume that side-by-side installation of different versions of YARN has already been taken care of by the installer. We will focus on certain key efforts in rolling upgrades that are directly related to Hadoop YARN—specifically, YARN-666, the umbrella Apache JIRA tracking ticket (note that the ticket is still open for further enhancements).

For the rest of this post, the features applicable to rolling upgrades also apply to rolling downgrades.

How a YARN cluster upgrade works

The rolling upgrade of any YARN cluster is part of the bigger upgrade of a Hadoop cluster with many other services and components. Here, we give a very brief overview of how a YARN cluster is upgraded. In general, YARN rolling upgrades include the following key steps:

rollling_updrades_1

  1. Upgrade (flip the current version) and restart MapReduce JobHistoryServer and YARN Timeline Service
  2. Assuming ZooKeeper has already been upgraded, upgrade and restart one or multiple RMs in case of a high-availability (HA) cluster
  3. Upgrade and restart NodeManagers (NMs) in a rolling fashion
  4. Finally, upgrade frameworks and clients so newly submitted applications work against newer framework versions

Let’s look at a simple example of a rolling upgrade process.

Image 1: upgrading a YARN cluster in a rolling fashion
Image 1: upgrading a YARN cluster in a rolling fashion

Image 1 represents a snapshot of a YARN cluster under upgrade with two RMs (HA mode on), four NMs and one running application. At the specific point in time represented, both RMs and one NM have already finished the upgrade process, and thus are running the newer version of the platform. The three remaining NMs are still running the older version.

Note that, with YARN, the platform upgrade is decoupled from the upgrade of the frameworks and applications running in a cluster. In the Image 1 example, the application had been launched with the old version. So even though one NM where this application is running has been updated, all the containers of the application still run the older version of the application/framework. While the platform’s upgrade is in progress, the application (running the older framework version) can run without any disruption.

In a follow-up blog post, we will describe how Apache Ambari helps administrators automatically take care of the rolling upgrade workflow. We’ll now go deeper into some of the development efforts of the community that created this disruption-free upgrade of a YARN cluster.

Resilience of YARN applications across RM restart

ResourceManager (RM) is the central authority in YARN for resource management and scheduling. It is responsible for allocating resources to applications like Hadoop MapReduce (MR) jobs, Apache Tez directed acyclic graphs (DAGs), and all other frameworks and applications running atop YARN.

Before HDP 2.1, the RM was a potential single point of failure in a YARN cluster. Multiple community efforts started to address this limitation so that an RM restart or failover would be transparent to end users, with zero or minimal impact to running applications. Here are a few details about these efforts.

  • Phase I: preserve application queues
    The focus of this phase, delivered as part of HDP 2.1, was to enhance YARN so RM could preserve the application-queue state into a pluggable persistent state store and then reread that state automatically upon restart. This meant users no longer had to resubmit applications in case of failed events. Existing applications in the queues are now automatically resubmitted after RM restarts. YARN-128 is the umbrella YARN JIRA ticket that tracked this phase.

    This does not require YARN to save any running state in the RM. Rather, each application either simply runs from scratch after RM restarts or uses its own recovery mechanism to continue from where it left off.

  • Phase II: preserve work of running applications
    This phase was delivered as part of HDP 2.2, adding to the groundwork of Phase I. By taking advantage of the recovered application queues from Phase I, combining that information with container statuses from all the NMs, and then pulling together allocation requests from the running ApplicationMasters (AM) in the cluster, RM restarts. This effort is tracked in its entirety under the JIRA ticket YARN-556.

    Applications no longer need to be restarted, they just re-sync with newly started RM. Thus, no work is lost due to an RM crash/reboot event.

  • RM failover

    RM failover is a related effort that takes advantage of the two phases of RM restart, plus enables a YARN cluster to be highly available. This is tracked at YARN-149 and was delivered as part of HDP 2.1. The goal was to support failover of RM from one instance to another—potentially running on a different machine—for high availability. It involves leader election, transfer of resource management authority to a newly elected leader, and client redirection to the new leader.

Resilience of YARN applications across NM restarts

Before HDP 2.2, a YARN cluster was fairly resilient to one or few NM restarts. That is, YARN would identify in due time that a NM died and inform the applications, which would then be expected to deal with node failures. Meanwhile, any running containers of all applications on that machine would be deemed dead and explicitly marked as such in the RM and the AM.

However, when running in a cluster with nodes (and/or racks), failing and rejoining its cluster is a fundamental design assumption of distributed applications built on top of YARN. And this scheme of the forceful termination of work on nodes won’t help for large-scale restarts of NMs in a cluster—specifically, the type that occur during software upgrades.

Work-preserving NM restart in HDP 2.2

As mentioned, losing containers during NM restart is a major problem for upgrades. It becomes worse if the upgrade happens on a live cluster full of running applications. Losing in-progress work is somewhat bearable for short applications, but those that run for long periods of time—sometimes days—are severely impacted at these upgrade points.

Starting with HDP 2.2 and continuing through community efforts in Apache Hadoop 2.6, NMs now have a mode that constantly records important container lifecycle-related information. Because of this persistency, even if NMs restart during an upgrade process, the restarted slaves simply reload the local lifecycle state, reconnect back to the still-alive containers (usually the process trees) and happily proceed as if they are just waking up from a good sleep. This is a major enabler of rolling upgrades. Essentially, individual machines in a cluster can be upgraded without losing any work and without any visible impact to the running applications.

Changes to data formats and wire compatibility

In addition to improving the resilience of platform daemons across restart, HDP 2.2 also comes with infrastructure changes that enable upgrading clusters in a rolling fashion. A few of these important changes:

  • Versioning is implemented in the state stores of RM, as well as the NMs, to allow daemons running new software to read old state data and vice versa
  • Tokens’ data are now encoded using Protocol Buffers to support compatible evolution of security-related tokens over time, as well as to enable tokens distributed in a cluster to work across combinations of new RM with old and new NMs
  • Steps are now in place for wire protocols to evolve compatibly in support of wire compatibility between mixed versions of RM and NMs
  • New methods mandate checks to accept only specific versions and to avoid version drift in clusters

Impact of rolling upgrades on MR applications

We now turn our attention from the platform to the land of YARN applications.

One of the big challenges for MapReduce (MR) during rolling upgrades and downgrades is maintaining the consistency of runtime libraries used by the MR tasks launched while an upgrade is underway. Before HDP 2.2, all MR applications depended upon local installation of MR libraries on all the cluster machines. When rolling upgrades or downgrades start in a cluster, the local installation versions of MR libraries will definitely change while an MR job in in progress. This will cause tasks launched on either side of a cluster upgrade to potentially see different versions of the libraries, resulting in inconsistent behavior and possibly runtime errors.

To solve this problem, MR in HDP 2.2 uses the well-known distributed-cache feature of YARN to deploy MR libraries. HDP 2.2 comes with a prebuilt package (a tarball) that contains all the jars required by MR applications. This tarball is copied to the Apache Hadoop Distributed File System (HDFS) during deployment. Think of this process as installing a versioned framework on YARN via HDFS instead of cluster nodes. When launching a MR job, YARN’s distributed cache ensures that the tarball is localized and expanded on each NM running the job’s tasks. The MR framework client points all MR jobs to the localized libraries, and every launched task uses the localized jars. This way, all the MR tasks for the same job will see a consistent version of MR runtime libraries, irrespective of the version the platform corresponds to on any node.

Impact of rolling upgrades on YARN applications in general

As with MR, any YARN application is also affected by the support of YARN rolling upgrades in HDP 2.2. In order to run on YARN, even if an application doesn’t have a story for its own upgrades or rolling upgrades, it should still be able to run smoothly in the presence of rolling upgrades of the underlying platform. In this context, YARN applications may have to deal with a couple issues:

  • Consistency of runtime libraries

    Similar to MR, applications in general should not make the assumption that runtime libraries, such as HDFS client libraries on each node, are always the same during an application’s entire execution. A simple solution is to package all runtime libraries as part of the app, itself.

  • Consistency of job configuration files

    Configuration files on different nodes may also become inconsistent while a rolling upgrade is in progress. Any configuration file, whether platform configs (e.g., core-site.xml) or application configs (e.g., tez-site.xml), may change during a rolling upgrade if present on the cluster nodes. To overcome this problem, rather than relying on the config files installed on the cluster directly, we suggest using the distributed cache to propagate the job configuration files from the client gateway machines.

It is also worth noting that if an application needs to support rolling upgrades, it should either be installed to all cluster machines in versioned directories or it should use the YARN distributed cache (similar to MR) to localize its framework.

Summary

In this post, we introduced the basics of rolling upgrades and downgrades in YARN. In subsequent posts, we will dive deeper into the goals, design and architectural details of each of these efforts. Rolling upgrades and downgrades can make the task of moving to HDP 2.2 a smooth experience for administrators and users, alike.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>