Today, organizations use the Apache Hadoop™ stack in the form of a central data lake to store their critical datasets and power their analytical processing workloads. A key requirement for the Hadoop cluster and the services running on it is to be highly available and flawlessly continue to function while software is being upgraded. In the past, the Hadoop community has added enterprise features such as High Availability (HA) to various components of the stack, snapshots, improved disaster recovery etc. The next step in this evolution is the addition of support for rolling upgrades.
Our approach at Hortonworks is to always ensure that the Hadoop stack we ship is truly enterprise ready. While software installation and wire-compatibility were considered by many to be the sole missing pieces for rolling upgrades, it turns out there were additional issues that needed to be addressed for a true enterprise-quality rolling upgrade solution. This blog describes the wide range of issues we have addressed to reach this milestone in various Hadoop stack components, culminating in the HDP 2.2 release. A subsequent series of blogs will dive deeper into each of the key enhancements. All bug fixes and enhancements have been included in each of the respective Apache open source projects for wide community use.
Historically administrators of a Hadoop cluster have needed to schedule a site wide down-time in order to upgrade the cluster software, resulting in interruptions to user workloads and potential loss of all the running applications. The Rolling upgrades feature in HDP 2.2 addresses this limitation.
The rolling upgrades feature fundamentally involves software upgrade of an active Hadoop cluster from one version of the Hadoop software stack to another, without requiring a shutdown of the cluster and all its applications/jobs. Users can continue to use the cluster actively while the upgrade is in progress. Typically, the cluster administrator will do a partial upgrade, restricting it to a subset of the cluster, observe the behavior of the newer stack and finally continue with the upgrade of the remaining cluster if the partial upgrade went well. After the partial upgrade, and before committing to a cluster-wide upgrade, administrators can also choose to downgrade the cluster’s subset back to the previous version as needed (e.g. a serious upgrade related problem is encountered, performance has degraded or some applications are not running well, etc.).
With the Apache Hadoop release 2.2 (and HDP 2.0), Hadoop moved to wire-compatible protocols using protocol buffers for both internal and client communications. This is a key and the most obvious prerequisite for supporting rolling upgrades. The next, and a relatively straight-forward, requirement is a packaging solution that allowed side-by-side installs of two versions of software on every machine. Is that sufficient? No, not for enterprise quality rolling upgrades! A Hadoop stack is a distributed-system made up of several different services and there are several key issues that have to be considered and addressed at all layers.
This blog and the subsequent blog series will describe a variety of issues across the stack that we have addressed over the last 6-8 months to provide a comprehensive enterprise quality rolling upgrade. The following is a list of key areas that required significant engineering effort to implement rolling upgrades of a Hadoop stack:
We now briefly give summaries of efforts in some of the above listed key areas of the HDP stack.
In order to support rolling upgrades, a key requirement is to have multiple versions of the Hadoop stack installed side-by-side on each machine as the cluster is rolled from one version to the next. We have implemented a unique solution where we leverage standard package management approaches (RPMs, Debians etc) and add the version number to the package name. This has two advantages. System Administrators can continue to use and rely on their existing tooling and best practices – an advantage against proprietary packaging formats. Further, in addition to allowing automated install and upgrade via Apache Ambari, our solution allows customers who prefer their own installation scripts and mechanisms to perform rolling upgrades themselves.
The first and foremost piece that needs to be addressed in HDFS is the ability to safeguard the data during a rolling upgrade. HDFS DataNodes, from the earliest days (2007) had a way of safeguarding the data by hard linking HDFS data-blocks to minimize exposure to potential errors in the newer version of the code. This mechanism has been religiously used for over 8 years for both major and minor non-rolling upgrades. Unfortunately this existing mechanism does not work for rolling upgrades since its design was based on all the Datanodes being upgraded together. Safeguarding the data is a serious requirement without which an upgrade is risky. HDFS was enhanced to set markers in logs to be useful for rollback and also for DNs to be in a special mode to avoid deletion of blocks during upgrade.
Further the Namenode has also moved to protobuf based persistent state so that upgrades and downgrades are much easier. This work was completed last year – we will cover this in detail in an upcoming blog.
Support for end-to-end rolling upgrades/downgrades of a Hadoop YARN cluster involved progress on several different fronts as described below.
Resource Manager (RM) is the central authority in YARN for resource management and scheduling. Before HDP 2.1, the RM was a potential single point of failure in a YARN cluster. We addressed this limitation in multiple phases.
In Phase I (Preserve application-queues) as part of HDP 2.1, we enhanced YARN so that RM can preserve application-queue state into a pluggable persistent state-store and reload that state automatically upon restart. After this phase, users were no longer required to re-submit the applications in case of failure events. This feature was the first step towards high availability for YARN ResourceManager.
In Phase II (Preserve work of running applications), delivered now as part of HDP 2.2, adding to the groundwork of Phase I, RM can restart without any loss of running work too! Applications are no longer required to be restarted, as they will just re-sync with the newly started RM. This can further be combined together with the ResourceManager failover functionality that we shipped as part of HDP 2.1.
Before HDP 2.2, when YARN identifies that a NodeManager (NM) died, any running containers (including ApplicationMasters!) of all applications on that machine are deemed dead by the ResourceManager! While this scheme works, forceful termination of work on each node as it is upgraded in a rolling fashion is too disruptive and likely to fail workload Service Level Agreements. Work-preserving NodeManager restart in HDP 2.2 addresses this limitation and is a major enabler of rolling-upgrades – essentially individual machines in a cluster can be upgraded without losing any work and without any visible impact to the running applications.
The above picture shows how the process of retaining work happens in the Work-preserving NodeManager restart feature.
In addition to improving the platform daemons to be resilient across restart, YARN in HDP 2.2 also comes with other infrastructure changes to enable upgrading clusters in a rolling fashion: versioning of the state-stores of ResourceManager and NodeManagers, compatible evolution of security related tokens over time, wire-compatibility between mixed versions of RM etc.
A challenging problem during an upgrade is that running applications MUST maintain the context based on the version from which they were submitted rather than the current running version of YARN. Before HDP 2.2, where the upgrade model was a shutdown-and-upgrade, applications (like Hadoop MapReduce, Tez) depended upon a consistent local installation of libraries on all the cluster machines (client and server machine); this meant that the context of the client and the context of the yarn cluster are identical. When rolling upgrades or downgrades start in a cluster, the versions of local installation of the libraries will definitely change while an application is in progress; resulting in inconsistent behavior, and possibly runtime errors.
To solve this problem, in HDP 2.2, we no longer depend on local installation but instead use HDFS to store multiple versions of the libraries that are under use. With the magic of YARN’s distributed cache, all application containers can now consistently execute using the version of libraries that matches the context of the application at the time of submission, regardless of what version the platform corresponds to on any node. Once the platform upgrade succeeds and stabilizes, administrators can, at their own pace, explicitly upgrade all the frameworks in the cluster along with their clients. An added benefit of this approach is that different users can run their applications using different versions of libraries and upgrade at will. This is summarily depicted in the following picture taking a MapReduce job as an example:
HDP 2.2 comes with a pre-built package (a tarball) containing all the jars required by MapReduce and Tez applications.
Besides the Hive CLI, whose upgrade is relatively straight forward, Hive runs three daemons: HiveServer2, Metastore Server and WebHCat Server.
For both Metastore and WebHCat, fast restarts and client side retries are used to support rolling upgrades between different Hive versions.
A key problem with HiveServer2 is that it is stateful and a simple restart to the new version loses queries in progress. We enhanced HiveServer2 to allow multiple instances, where an older instance continues running until its current queries are drained and a newer instance accepts future queries. We solved the problem of a JDBC client dynamically discovering and using newer HiveServer2 endpoints. Previously, while opening a new connection, a JDBC client had to point to a specific host-port of the HiveServer2 process – a design not suitable to implement rolling upgrade (or HA). To address this, we implemented dynamic service discovery for HiveServer2 as part of HIVE-8376. We rely on ZooKeeper for maintaining the service information for each HiveServer2 instance. The JDBC driver picks a HiveServer2 instance from ZooKeeper by reading its host and port information from a randomly selected znode under the HiveServer2 namespace. The client then uses this particular instance for the entire session duration. During an upgrade, the znode entries of instances running the old version are removed so that they do not receive new work. This enhancement has a side benefit beyond rolling upgrade: it allows load balancing and also enables high availability.
Apache Zookeeper(ZK) is a fundamental building block for a Hadoop cluster. Several different projects rely on Zookeeper for consistent state and coordination. It is already well geared towards rolling upgrades since it maintains compatibility in its internal and client protocols. Rolling through the ZK instance is fairly straightforward except one must make sure that the quorum is maintained during upgrade. The orchestration of the upgrades happens in such a way that the next zookeeper server is only marked for upgrade after the previous one successfully finishes its upgrade and comes back up. This guarantees that we keep the Zookeeper quorum always up, running and consistent while an upgrade rolls through.
We enhanced Apache Oozie to tie the context of the applications submitted (MR/Tez jobs, Hive, Pig queries) to the submission of an oozie-action. On the daemons side, Oozie servers already store their persistent state into their database and we simply restart all the Oozie servers at once as it is expected that Oozie servers do a fast restart.
Apache HBase and Accumulo projects have always been very amenable to rolling upgrades, except for some scripting changes needed. We also made sure that the rest of the components like Apache Sqoop, Flume, Knox, Ranger, Falcon are all well-integrated into the end-to-end rolling upgrades story.
The orchestration details of the end-to-end rolling upgrades are briefly outlined below.
We order an upgrade from 2.2 to the next release roughly as follows:
In future versions, we will consider enabling service-layer-at-a-time, node-at-a-time etc, although our existing mechanisms should already support it.
It is worth noting that after each significant step, a smoke test is run. Also at each of these steps, the user is advised to do his/her own validation to proceed to the next phase. If at any point failure occurs without any possible solution to move forward, one can do the downgrade or rollback.
The clients are upgraded only after the server components because new server versions are designed to work with old clients. Also, we upgrade all the clients together because of interdependencies of the client libraries; however, in the future, we may revisit this all-at-once-client upgrading scheme.
The Rolling upgrades functionality is key requirement for enterprise Hadoop. Different areas of the Hadoop stack have been enhanced over the last year to support enterprise quality rolling upgrades. In the coming weeks, we will follow up with a series of posts that will dive deeper into each of these areas.