High Availability and Hadoop 1.0 – Perfect Together
In Shaun Connolly’s post about balancing community innovation and enterprise stability, he discussed the importance of an open source project forging ahead with big improvements that are expected to be initially buggy and incomplete functionally but then stabilize over time. In the case of Apache Hadoop 2.0, currently in community Alpha release, the big improvements have been underway for the past 3 years and include such things as:
- Next-gen MapReduce (aka YARN) that opens up Hadoop’s job processing architecture to other application workloads beyond MapReduce,
- New HDFS pipe-line to support append and flush,
- HDFS federation and performance improvements that enable Hadoop to scale to 1000’s more nodes in a cluster, and
- High availability improvements that address some of the single point of failure issues that are often used as examples of how Hadoop may not be as enterprise-ready as it could be.
In the case of high availability (HA), it can take many months or years to get these types of solutions rock solid. While Hadoop 2.0 contains important HA-related features such as HDFS hot standby, we want to make sure we give it time to complete its community release process and allow extra time after that for bugs to be found and fixed to harden it for broad enterprise production use.
Moreover, implementing HA for Hadoop’s NameNode service, for example, can’t be thought of in a vacuum. It’s important to take a holistic, full stack view of HA: from the underlying server, through the operating system layer, on up through the actual services that require HA, as well as the impact those services may have on any other clients or services that depend on them.
HA is inherently an enterprise “ility” that is focused on minimizing unplanned downtime and IT service disruption. It is therefore critical that full stack high availability be founded on a rock-solid and proven foundation. We at Hortonworks are confident that in the Hadoop world, that stable foundation is Hadoop 1.0.
When discussing HA, we often get the following questions:
- Isn’t all of the HA-related work for Hadoop being done by Hortonworks, Cloudera, and the community members on Hadoop 2.0?
- Do we have to wait for Hadoop 2.0 to become as rock-solid and enterprise stable as Hadoop 1.0 before end users can have a full stack HA solution?
- Isn’t it too hard to do HA for Hadoop 1.0, as a recent article on The Register seems to imply?
We are excited to say that we’ve been hard at work with virtualization and operating system vendors on a solution architecture for full stack high availability that
- builds on the rock-solid Hadoop 1.0 foundation,
- leverages proven HA capabilities within the virtualization and operating system layers, and
- is complementary with the Hadoop 2.0 HA efforts.
As a matter of fact, I discussed this Hadoop 1.0 HA solution architecture in my keynote at Hadoop Summit last week, and below is an illustration of the architecture that was demoed by the Hortonworks product team on the show floor:
The above diagram focuses mostly on HA as it relates to the NameNode and JobTracker services.
As we see it, key requirements for full stack high availability include:
- Flexible solution that not only works for NameNode services but also other Hadoop-related master services such as JobTracker, Hive, Oozie, etc.
- Open solution that integrates with proven, industry standard HA technologies for robust approach to service failure detection, IP failover, and fencing that deals with split-brain scenarios; rock solid fencing is critical to ensure shared data is not corrupted.
- Ability for services (such as JobTracker) to automatically detect the failure/failover of services they are dependent on (such a NameNode) and to have the ability to pause, retry and recover.
- Ability to configure clients to automatically pause and retry during failover of services they are dependent on; this should be flexible enough to support batch-oriented clients that can afford to block and wait and interactive clients (ex. HBase clients) that may want to handle failure conditions immediately.
- Failover times that range from 15 seconds for smaller clusters of 50-60 nodes on up to a couple of minutes for clusters with 100’s of nodes and beyond; in each of these scenarios, client-side access should be insulated from service failures and be able to deal with failover as transparently as possible.
- Easy manual failover for planned downtime such as system or software upgrades where failover needs to be coordinated as part of a larger process.
At Hadoop Summit, we announced the jointly developed Hortonworks Data Platform High Availability (HA) Kit for VMware vSphere customers that enables full stack high availability for Hadoop 1.0 by eliminating the NameNode and JobTracker single points of failure. It is a flexible virtual machine-based high availability solution that integrates with the VMware vSphere™ platform’s HA functionality to monitor and automate failover for NameNode and JobTracker master services running within the Hortonworks Data Platform (HDP).
VMware customers can utilize their existing vSphere installations to deploy HA NameNode and JobTracker nodes as virtual machines in their HDP production cluster. Doing so provides the added benefits of automated restart of virtual machines in event of server or OS failures and smart resource management that confirms sufficient resources are available to restart virtual machines on different servers in event of server failure. For more information on the Hortonworks Data Platform High Availability (HA) Kit for VMware vSphere customers, register your interest in the HDP HA Kit for vSphere and work with our product team on trying out the new HA capabilities. Your input on this key feature area is important, so please sign up!
You should view our full stack high availability efforts with VMware as just a start. We are continuing our efforts to not only round out the VMware solution but also introduce robust full stack HA solution architectures with other partners [STAY TUNED!]. We are also firmly committed to continuing the Hadoop 2.0 HA work that we started and will roll that out widely when it stabilizes and is ready for broader enterprise use.
If you want to learn more about Hortonworks Data Platform, join us on June 26 (10am Pacific/1pm Eastern) for the live webinar “Apache Hadoop Just Got Simpler,” as we outline and demo the key features of the Hortonworks Data Platform (HDP).