With YARN and HDFS at the architectural center, Hadoop has emerged as a key component of any modern data architecture. Today, enterprises utilize Hadoop to store critical datasets and power many of their critical workloads. With this in mind, the services and data within a Hadoop cluster needed to be highly available in face of failures and continue to function while the upgrading to the latest software version.
With the Hortonworks Data Platform (HDP) 2.2, we have enhanced the core platform packaging to put in place support for rolling upgrades of the HDP stack while the cluster is actively servicing users. For more details, please see here.
In order to support rolling upgrade it must be possible to have multiple versions of the Hadoop stack installed side-by-side as the cluster is rolled from one version to the next. We have taken a different approach: Rather than use proprietary packaging, we have leveraged standard packaging management approaches, using RPMs and Debian.
This standard packaging management approach enables system administrators to continue to use and rely on their existing tooling and best practices. Customers have a choice: either use Apache Ambari for automated rolling upgrades or roll their own solution (using Puppet, Chef, etc) to manage rolling upgrades.
HDP uses a structured rolling upgrade approach to provide a reliable and efficient upgrade with minimized service disruption. This rolling upgrade approach drives the need for side by side deployments.
In the Prepare phase, all steps are taken to prepare the cluster for upgrade, before HDP components are upgraded. The side-by-side install allows a new version of HDP bits to be installed in place before an upgrade is started, reducing the time and potential for failures during the upgrade process. This feature also benefits full-shutdown-and-upgrade approaches since the new bits can be installed prior to the shutdown.
Once the bits are laid down successfully, HDP components can be upgraded in a rolling fashion across the nodes in the cluster. This rolling upgrade phase requires side by side installs to function:
Starting with HDP 2.2, a new version of HDP will be deployed alongside the existing online version of HDP in preparation for an upgrade.
To enable this, HDP 2.2 RPM and Debian packages include the HDP version number in the name of the package. This was needed to make each RPM package of each version appear as if it is a different package all-to-gether.
This allows all HDP artifacts for a given release are deployed under a versioned directory:
All versions of HDP are deployed under the fixed directory path of /usr/hdp. For example, the following directory structure showcases two HDP versions deployed side by side:
│ ├── /usr/hdp/188.8.131.52-2041/hadoop/bin
│ ├── /usr/hdp/184.108.40.206-2041/hadoop/conf -> /etc/hadoop/conf
│ ├── /usr/hdp/220.127.116.11-2041/hadoop/lib
│ │ ├── /usr/hdp/18.104.22.168-2041/hadoop/lib/native
│ ├── /usr/hdp/22.214.171.124-2041/hadoop/libexec
│ ├── /usr/hdp/126.96.36.199-2041/hadoop/man
│ └── /usr/hdp/188.8.131.52-2041/hadoop/sbin
│ ├── /usr/hdp/184.108.40.206-2041/hadoop-hdfs/bin
│ ├── /usr/hdp/220.127.116.11-2041/hadoop-hdfs/lib
│ ├── /usr/hdp/18.104.22.168-2041/hadoop-hdfs/sbin
│ └── /usr/hdp/22.214.171.124-2041/hadoop-hdfs/webapps
│ ├── /usr/hdp/126.96.36.199-2041/hbase/bin
│ ├── /usr/hdp/188.8.131.52-2041/hbase/conf -> /etc/hbase/conf
│ ├── /usr/hdp/184.108.40.206-2041/hbase/doc
│ ├── /usr/hdp/220.127.116.11-2041/hbase/include
│ ├── /usr/hdp/18.104.22.168-2041/hbase/lib
├── /usr/hdp/22.214.171.124-2041/zookeeper/conf -> /etc/zookeeper/conf
│ ├── /usr/hdp/126.96.36.199-2611/hadoop/bin
│ ├── /usr/hdp/188.8.131.52-2611/hadoop/conf -> /etc/hadoop/conf
│ ├── /usr/hdp/184.108.40.206-2611/hadoop/lib
│ │ ├── /usr/hdp/220.127.116.11-2611/hadoop/lib/native
│ ├── /usr/hdp/18.104.22.168-2611/hadoop/libexec
│ ├── /usr/hdp/22.214.171.124-2611/hadoop/man
│ └── /usr/hdp/126.96.36.199-2611/hadoop/sbin
│ ├── /usr/hdp/188.8.131.52-2611/hadoop-hdfs/bin
│ ├── /usr/hdp/184.108.40.206-2611/hadoop-hdfs/lib
│ ├── /usr/hdp/220.127.116.11-2611/hadoop-hdfs/sbin
│ └── /usr/hdp/18.104.22.168-2611/hadoop-hdfs/webapps
│ ├── /usr/hdp/22.214.171.124-2611/hbase/bin
│ ├── /usr/hdp/126.96.36.199-2611/hbase/conf -> /etc/hbase/conf
│ ├── /usr/hdp/188.8.131.52-2611/hbase/doc
│ ├── /usr/hdp/184.108.40.206-2611/hbase/include
│ ├── /usr/hdp/220.127.116.11-2611/hbase/lib
├── /usr/hdp/18.104.22.168-2611/zookeeper/conf -> /etc/zookeeper/conf
With this layout, the HDFS DataNode can be upgraded first before the HBase RegionServer. The HBase RegionServer can then continue to run and continue to utilize the older version Hadoop and other dependency component libraries.
While multiple HDP versions will be deployed on the cluster, each HDP component service on a specific node can have a separate active version at a given point of time. For example, on a given node for HDFS, the DataNode component service and the NameNode component service can each be of different active versions.
To manage this capability, HDP uses symlinks to point to the active version for each HDP component service.
For example, the Hadoop DataNode service and Hadoop NameNode service will have symlinks that point to the current version. This enables, during an upgrade process, for the Hadoop Namenode to be on the newer version and the Hadoop DataNode to be on the older version.
Clients and component services each have their own symlink – enabling active jobs, that were scheduled with the old clients, to continue running with the old version client even while the component services are being upgraded to the new version.
For example, to upgrade the DataNode on a single machine to the latest version:
> Stop DataNode
> hdp-select set hadoop-hdfs-datanode 22.214.171.124-2600
# Set active version to the newer version
> Start DataNode
HDP is committed to enable the same repository management, install tooling and execution scripts that system administrators use to operate and manage Hadoop.
Since the packages are RPMs and Debian, ’yum’ and ‘apt-get’ are utilized to deploy each HDP component package.
HDP maintains the existing binary locations that execution scripts depend on. For example, /usr/bin/hadoop is maintained as a symlink and points to the active version’s Hadoop binary.
/usr/bin/hadoop -> <active version>
Let’s look at Apache Hadoop for example.
Hadoop component libraries are no longer found in “/usr/lib/hadoop/” . Now, each Hadoop components’ libraries are referenced through the corresponding directory:
… and so on for each Hadoop component.
For example, you will find the MapReduce examples jar in:
Configuration files can be placed /etc/hadoop/conf
/usr/bin/hadoop -> /usr/hdp/current/hadoop-client/bin/hadoop
Are you thinking of upgrading your HDP cluster? Try rolling upgrades. The enhanced packaging sets the stage for rolling upgrades for the entire HDP stack, while maintaining the support of packaging management tooling that Enterprise system administrators rely on.