A Highly Available NameNode for HDFS has been in development since last year. That effort focused singularly on the automatic failover of the NameNode for Hadoop 2.0. During that time we have realized two things.
First, we realized we should use an outside-in approach to the HA problem: start by designing the availability of the Hadoop system as a whole and then focus on the high-availability of individual components; that work lead to the Full Stack HA Architecture.
Second, we realized that we can build an HA NameNode for Hadoop 1.0 using industry proven solutions such as Linux HA and vSphere; this is important because HDFS in Hadoop 1 is been proven to be stable and reliable, while HDFS in Hadoop 2 is just beginning beta testing. This blog describes some technical details of HDFS NameNode HA in Hadoop 1. A future blog will give some more details on Full Stack HA.
The first and foremost question in people’s mind is: What is the difference between HA Hadoop 1 and Hadoop 2? My colleague Suresh and I wrote the original design for Hadoop 2 (see HDFS-1623) and have worked closely with the community on the implementation. HA in Hadoop 1 is the direct result of our experiences during that work.
Hadoop 2 HA focused on three areas:
- Hot failover: We have found that the difference in failover times are small between cold and hot failover for small to medium clusters; Hadoop 1 uses cold failover.
- Automatic failover: For Hadoop 1 we have used industry proven HA framework rather than use the Hadoop 2 Failover-Controller. The figure above illustrates NameNode HA in Hadoop 1 using Linux HA.
- Remove dependency on shared storage: this is still work in progress for Journal Daemons in trunk. Both Hadoop 2 and Hadoop 1 use shared storage.
To summarize, the difference in Hadoop 1 HA is cold failover and the use of industry standard HA frameworks. Lets look at details and implications of these differences below.
Failover Times and Cold versus Hot Failover
The failover time of a high available system with active-passive failover is the sum of (1) time to detect that the active service has failed, (2) time to elect a leader and/or for the leader to make a failover decision and communicate to the other party, and (3) the time to transition the standby service to active.
The first and second items are the same for cold or hot failover: they both rely on heartbeat timeouts, monitoring probe timeouts, etc. We have observed that total combined time for failure detection and leader election to range from 30 seconds to 2.5 minutes depending on the kind of failure; the lowest times are typical when the active server’s host or host operating system fails; hung processes take longer due to the grace period needed to be confident that the process is not blocked during Garbage Collection.
For the third item, the time to transition the standby service to active, Hadoop 1 requires starting a second NameNode and for the NameNode to get out of safe mode. In our experiments we have observed the following times:
- A 60 node cluster with 6 million blocks using 300TB raw storage, and 100K files: 30 seconds. Hence total failover time ranges from 1-3 minutes.
- A 200 node cluster with 20 million blocks occupying 1PB raw storage and 1 million files: 110 seconds. Hence total failover time ranges from 2.5 to 4.5 minutes.
Industry Standard HA Frameworks
As we stated in the Hadoop HDFS NameNode HA design document, HDFS-1623, HDFS NameNode HA design document [HDFS-1623], the notion of having a failover controller outside the NameNode was influenced by frameworks like Linux HA, Red Hat HA and Veritas Cluster[4,5]. Part of the decision for the Hadoop community to build our own failover controller was made because we felt it was useful for Hadoop to provide an out-of-the-box solution. Linux HA, being GPL, did not allow that.
For Hadoop 1 we decided not to back-port the failover controller from trunk, but instead use an industry standard HA frameworks. Why?
We wanted to add HA to the stable Hadoop line in as risk-free a way possible which lead us to use proven and robust HA frameworks. Many customers already have experience using these HA frameworks. These frameworks deal with monitoring timeout, service startup timeout, shutdown timeout, and have a way to flag and deal with repeated failures. Further, the industry proven frameworks offer several alternative fencing solutions including power-based fencing.
The same HA framework can be used to failover other Hadoop services such as the Job-Tracker; we have have already started the work on an HA Job-Tracker. These HA frameworks also provide the ability to share a common pool of server machines to host highly avaiolable NameNode, JobTracker and other Master daemons; the shared pools allow N-N, N-on-N and N+K failover. Finally they offer a way to perform manual switchover, coordinated shutdown of both and being able to run with one of the NameNodes down.
Using Failover solutions which keep the Namenode IP Address constant ensure that the web interfaces such as WebHDFS or the Hadoop consoles also failover. With IP failover, URLs will follow the service, regardless of where it is running. The use of mature IP failover-based solutions reduced the complexity, making it possible to implement HA on the stable Hadoop 1.0 line with a few, low risk changes to the Hadoop core.
Recently, Symantec independently described how to make the NameNode highly available using Veritas Cluster . The Figure above illustrates the NameNode HA in Hadoop 1 using Linux HA that will be available along with a similar solution using vSphere HA.
You are using cold failover and hence it is not practical.
The HDFS 2.0 HA design was driven by the needs of the very large clusters at Yahoo and Facebook. For small to medium clusters cold failover is only 30 to 120 seconds slower, as described above.
If this was so easy to do with Linux HA or other tools why didn’t the HDFS community do this earlier?
This is partly because the original HDFS team focused on very large clusters where cold failover was not practical. We assumed that Hadoop needed to provide its own built-in solution As we’ve developed this technology, we’ve heard directly from our customers that HA solutions are complex and that they prefer using their existing, well understood, solutions.
You have taken a different path from HA in Hadoop 2.
Not true – both are based on the same design principals. The difference is the focus on cold failover instead of hot failover. Further, the work is complimentary. All the work in the Hadoop 1.0 line is also being added to the Hadoop 2.0 line.
I need to use Linux HA, or another framework – isn’t that a hindrance?
Many of Hadoop users already use Linux HA, Red Hat HA, Veritas Cluster, or vSphere HA in their data centers. Linux HA is freely available and the cost of Red Hat’s HA is fairly low.
So is the Full Stack HA only a part of Hadoop 1?
No, Full Stack HA is orthogonal to the failover of a specific component such as the NameNode or the Job Tracker (see this post). Making the rest of the stack robust against transient failures of the layers underneath improves the entire stack. We will cover more detail of Full Stack HA in a up coming blog.
Does vSphere HA deal with service failures in addition to VM failures?
vSphere allows application level health checks and we have added an application-level monitor for the NameNode. A similar JobTracker-specific monitor will be available shortly as well.
When using vSphere HA solution, what is the advantage of hosting the NameNode and JobTracker each in their own vSphere VirtualMachine?
Hosting the master services in isolated vSphere VMs is an effective design. As vSphere monitors and manages each VM independently, vSphere servers can host independent VMs containing the NameNode, JobTracker, and other master services. If one service fails, that VM is killed and restarted – while the other VMs continue uninterrupted. You also gain the ease of maintenance and rollback that VMs offer.
All patches to core hadoop have been committed to Apache Hadoop trunk and branch 1.1; these are also incorporated in Hortonworks HDP 1 and HDP 1.1 releases. We plan to commit these to Hadoop 2-alpha.
Our monitoring code is targeted for inclusion into the still-in-incubation Ambari project; it has already been submitted as a patch .
Work is in progress to stabilize HDFS 2 which is currently entering beta testing and also Hadoop 2‘s HA (Hot Failover, failover controller, etc). We are also in the progress of providing failover for other Hadoop components, such as the JobTracker, in Hadoop 1.
- Linux HA: http://www.linux-ha.org/wiki/Main_Page
- Red Hat High Availability Add-On: http://www.redhat.com/products/enterprise-linux-add-ons/high-availability/
- vSphere HA: http://www.vmware.com/products/high-availability/overview.html
- Symantics’ Veritas Custer Framework: https://www.symantec.com/cluster-server
- NN HA using Symantics’ Veritas Custer Framework: http://www.symantec.com/connect/blogs/symantec-high-availability-solution-hadoop-namenode
- Hadoop Journal Daemon: HDFS-3092 and HDFS-3077
- HDFS NameNode HA design (for Hadoop 2): HDFS-1623
- Monitoring Library: Ambari-504
- Full Stack HA: http://hortonworks.com/blog/high-availability-and-hadoop-1-0-perfect-together/