Category Archives: High Availability


Proper Care and Feeding of Drives in a Hadoop Cluster: A Conversation with StackIQ’s Dr. Bruno

In a recent blog post, Hortonworks’ Steve Loughran discussed Apache Hadoop’s preference for JBOD-configured storage vs. the allure of RAID-0. As more enterprises are beginning to move beyond the science experiment stage and begin deploying Hadoop into their production environments, they are learning that Hadoop is quite different than other services in their data centers, such as web, mail, and database servers.They are learning that to achieve optimal performance, you need to pay particular attention to configuring the underlying hardware.

To find out more, we had a chat with Dr. Greg Bruno, VP of Engineering, and co-founder of StackIQ, a Hortonworks partner, about the real life implications of managing hard drives (HDDs) in a modern Hadoop cluster.

Q. Why isn’t it considered good practice to configure drives in Hadoop clusters as RAID-0 disk arrays?

A. Hadoop prefers a set of separate disks to the same set managed as a RAID-0 disk array. Read speeds are particularly important to the performance of a Hadoop cluster, and in his post, Steve makes the point that since drive speeds vary, and RAID-0 reads occur at the speed of the slowest disk in the array, a RAID-0 configuration may well be slower than a non-RAID configuration. The bigger issue, in my opinion, is reliability. If a set of disks is configured as a RAID-0 array, then one disk failure in that array will take that entire volume down, and if all the disks in a node are configured as a single RAID-0 array, then a single disk failure will take all the node’s data down. By configuring multiple disks in a RAID-0 array, you magnify the probability of that volume going offline due to a single disk failure and you maximize the amount of data that goes offline when that single failure occurs.

Q: Modern servers have a lot of disks. What’s the impact of losing a single disk when you have 12 3TB drive in each node?

A:  When a single drive fails when Hadoop is configured in its default state, the ENTIRE NODE gets taken offline. Back when servers typically had 6 x 1.5TB drives in them, losing a single disk would cause the loss of 0.02% of total storage in a typical 10PB, three-replica setup. With today’s hardware — typically 12 x 3TB drives per node, losing a single disk results in the loss of five times as much data.

Q: Aren’t today’s HDDs much more reliable than they used to be? Is it worth the extra work to handle the rare cases when a drive fails?

A: While drives are much more reliable than they used to be, they are still the cause of the lion’s share of support tickets in a Hadoop cluster. In fact, according to Bharath Mundlapudi, a Core Hadoop Engineer while working at Yahoo, disk drive failures account for fully 50% of siteops trouble tickets. That’s more than three times the next highest source of tickets.

Q: What does that represent in real terms?

A: It represents a lot of work for systems administrators. How much depends on the size and age of the cluster in question. For example, Facebook, which has some very large clusters, reports that their failure detection and automated repair system is doing the work of approximately 200 full time system administrators.

Q: OK, but not many organizations have clusters that large. What about a typical enterprise setup?

A: Our experience indicates that a 1,000 node cluster containing 12,000 drives for a total raw storage capacity of 48 peta-bytes can expect about 3 drive failures a day in its third year of operation. Drive failure rates rise as the devices age. For a 500 node cluster, you’re looking at a drive failure every 17 hours or so.

Q: Doesn’t this make it hard for the cluster operator to manage? How do they keep up?

A: Without the right tools and methodology, it is very difficult for cluster operators to manage clusters at scale. They typically have to write scripts to scan the cluster, detect disk failures, and report them. Then, once the offending drive has been replaced, commands must be run for the controller to recognize the new drive, OS commands need to be executed to format the drive, and then some Hadoop commands are required to add the disk back to the configuration.

Q: Presumably it’s not quite as challenging for StackIQ customers?

A: StackIQ’s mission is to make cluster operation as painless as possible, which is why we have developed tools to manage the entire lifecycle of the disk. While we haven’t figured out how to get our software to physically pull a bad drive and replace it with a new one, we automate the rest of it — from the initial deployment of the drive, detecting and reporting the error, and re-integrating the replacement drive into the configuration.

One of the features we’ve developed in StackIQ’s management software automatically configures chassis with LSI MegaRaid controllers into “JBODs”, that is, every disk in the chassis will be configured as an individual device.

In addition, a user can specify which disk they want in the chassis to be the boot disk via an attribute (e.g., “bootdisk0″) and if an optional secondary boot disk attribute is specified (“bootdisk1″), then our code will configure both those disks as a “mirror” (RAID1) while still making all the other non-boot disks available to Hadoop as individual disks.  A recent StackIQ customer made their purchasing decision on this feature alone, as they recently went through the painful exercise of changing a mid-size cluster’s RAID configuration by booting each server, one-by-one, catching a key press at the controller prompt, and fixing the configuration by-hand.  Not a fun exercise when you are under the gun by management to get production cluster online.

Q: With that many drive failures, clusters will be chewing through disks at a brisk rate. That could get expensive. That works out to something like 1000 drives/year X $100/drive = $100k per year just for replacement drives.

A: True, which speaks to the need for software which will make the most efficient use of your resources –  intelligent, automated cluster management software can find faulty drives automatically, and bring up a replacement drive quickly.

Q: Doesn’t automation take control out of the hands of the skilled cluster operators?

A: We believe it should be up to the cluster operator to set policies on how much automation to incorporate into their workflows. Our software reflects that philosophy, letting operators choose from a range of policies that go all the way from having the operator run all the commands manually, all the way to a fully automated repair where all the operator needs to do is push in the new drive and let StackIQ’s software do the rest.

Q: Can’t this be done with a simple command script that runs on all nodes?

A: That might be workable in a homogeneous environment, where all the nodes are the same. But in the real world, different nodes require different configurations. Even the disks are likely configured differently in nodes within the clusters. Handling those variables in a static script would be very difficult to accomplish. For example, if your cluster expands over time, you may be adding chassis with different drive configurations. Static scripts wouldn’t be able to deal with this situation. The StackIQ management software has intimate knowledge of the hardware and software in the cluster, so it knows exactly how to handle each drive in each node across the entire cluster, even in a heterogeneous environment.

Conclusion

So there you have it. The folks behind StackIQ cluster management software agree with Steve Loughran’s recommendation to forego RAID-0 for Hadoop clusters. In fact, they provide the management tools to make it easier to do. So take the advice of our experts, and configure your cluster servers as “Just a Bunch of Disks.”

For more information on StackIQ, please visit their website or follow their Twitter handle (@StackIQ). You can also follow Dr. Greg Bruno directly on his Twitter handle (@itsDrBruno).

~ Lisa Sensmeier

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).

HBase was kept busy by running a load test – LoadTestTool (available in 0.92 branch), with a set of arguments (number of reader/writer threads, sizes of rows, etc.) that were selected induced significant pressure on the HBase cluster. In turn, the configuration of HBase was artificially modified so that HBase would make lots of trips to the NameNode for file operations (low flush thresholds, very low major compactions frequency). For the test, the NameNode was repeatedly brought up and down (specifically, a loop of “bring down the namenode, let it remain down for a small period of time, bring up the namenode, let it remain up for another period of time”). This stop-start-pattern had some randomness built into it.

The cluster kept up reasonably well with the above load and the failure mode. But we also saw that we were losing HBase RegionServers somewhat randomly. Upon a close analysis of the logs on the NameNode & RegionServers, what seemed to be the case was that file lengths were not recorded correctly in the edit-logs. This issue turned out to be a known issue, that was addressed in HDFS-1108. The fix was backported to Hadoop-1.0.x line. It should be noted that HA team at Hortonworks had fixed other issues and as is the usual practice for us, these fixes were applied to Apache Hadoop trunk and back ported to Hadoop 1.x line and will also be back-pported to the 2-alpha.

With the above fix in HDFS, the tests were rerun. The cluster remained up without any RegionServer losses for more than 48 hours. No glitches!

Well to be precise, the cluster started behaving weirdly since the datanodes started running out of space since the HBase load generation has successfully filled up the HDFS capacity in spite of repeated NameNode restarts. (I should file some jiras to handle that more gracefully!). While my tests were not using the automated failover of the NameNode node one can now configure the NameNode in Hadoop 1 to automatically failover using industry proven solutions as described in Sanjay’s post; the HBase community can start deploying NameNode HA along with resilience as the Namenode fails over.

Sanjay’s blog gives more details on how to deploy NameNode HA. Please get in touch with me (ddas@hortonworks.com) or Sanjay (sanjay@hortonworks.com) if you need more details on NameNode HA, Full Stack HA with respect to HBase or any part of the above tests.

HA Namenode for HDFS with Hadoop 1.0 – Part 1

Introduction

A Highly Available NameNode for HDFS has been in development since last year. That effort focused singularly on the automatic failover of the NameNode for Hadoop 2.0. During that time we have realized two things.

First, we realized we should use an outside-in approach to the HA problem: start by designing the availability of the Hadoop system as a whole and then focus on the high-availability of individual components; that work lead to the Full Stack HA Architecture.

Second, we realized that we can build an HA NameNode for Hadoop 1.0 using industry proven solutions such as Linux HA and vSphere; this is important because HDFS in Hadoop 1 is been proven to be stable and reliable, while HDFS in Hadoop 2 is just beginning beta testing. This blog describes some technical details of HDFS NameNode HA in Hadoop 1. A future blog will give some more details on Full Stack HA.

The first and foremost question in people’s mind is: What is the difference between HA Hadoop 1 and Hadoop 2? My colleague Suresh and I wrote the original design for Hadoop 2 (see HDFS-1623) and have worked closely with the community on the implementation. HA in Hadoop 1 is the direct result of our experiences during that work.

Hadoop 2 HA focused on three areas:

  • Hot failover: We have found that the difference in failover times are small between cold and hot failover for small to medium clusters; Hadoop 1 uses cold failover.
  • Automatic failover: For Hadoop 1 we have used industry proven HA framework rather than use the Hadoop 2 Failover-Controller. The figure above illustrates NameNode HA in Hadoop 1 using Linux HA.
  • Remove dependency on shared storage: this is still work in progress for Journal Daemons in trunk[6]. Both Hadoop 2 and Hadoop 1 use shared storage.

To summarize, the difference in Hadoop 1 HA is cold failover and the use of industry standard HA frameworks. Lets look at details and implications of these differences below.

Failover Times and Cold versus Hot Failover

The failover time of a high available system with active-passive failover is the sum of (1) time to detect that the active service has failed, (2) time to elect a leader and/or for the leader to make a failover decision and communicate to the other party, and (3) the time to transition the standby service to active.

The first and second items are the same for cold or hot failover: they both rely on heartbeat timeouts, monitoring probe timeouts, etc. We have observed that total combined time for failure detection and leader election to range from 30 seconds to 2.5 minutes depending on the kind of failure; the lowest times are typical when the active server’s host or host operating system fails; hung processes take longer due to the grace period needed to be confident that the process is not blocked during Garbage Collection.

For the third item, the time to transition the standby service to active, Hadoop 1 requires starting a second NameNode and for the NameNode to get out of safe mode. In our experiments we have observed the following times:

  • A 60 node cluster with 6 million blocks using 300TB raw storage, and 100K files: 30 seconds. Hence total failover time ranges from 1-3 minutes.
  • A 200 node cluster with 20 million blocks occupying 1PB raw storage and 1 million files: 110 seconds. Hence total failover time ranges from 2.5 to 4.5 minutes.

Industry Standard HA Frameworks

As we stated in the Hadoop HDFS NameNode HA design document, HDFS-1623, HDFS NameNode HA design document [HDFS-1623], the notion of having a failover controller outside the NameNode was influenced by frameworks like Linux HA[1], Red Hat HA[2] and Veritas Cluster[4,5]. Part of the decision for the Hadoop community to build our own failover controller was made because we felt it was useful for Hadoop to provide an out-of-the-box solution. Linux HA, being GPL, did not allow that.

For Hadoop 1 we decided not to back-port the failover controller from trunk, but instead use an industry standard HA frameworks. Why?

We wanted to add HA to the stable Hadoop line in as risk-free a way possible which lead us to use proven and robust HA frameworks. Many customers already have experience using these HA frameworks. These frameworks deal with monitoring timeout, service startup timeout, shutdown timeout, and have a way to flag and deal with repeated failures. Further, the industry proven frameworks offer several alternative fencing solutions including power-based fencing.

The same HA framework can be used to failover other Hadoop services such as the Job-Tracker; we have have already started the work on an HA Job-Tracker. These HA frameworks also provide the ability to share a common pool of server machines to host highly avaiolable NameNode, JobTracker and other Master daemons; the shared pools allow N-N, N-on-N and N+K failover. Finally they offer a way to perform manual switchover, coordinated shutdown of both and being able to run with one of the NameNodes down.

Using Failover solutions which keep the Namenode IP Address constant ensure that the web interfaces such as WebHDFS or the Hadoop consoles also failover. With IP failover, URLs will follow the service, regardless of where it is running. The use of mature IP failover-based solutions reduced the complexity, making it possible to implement HA on the stable Hadoop 1.0 line with a few, low risk changes to the Hadoop core.

Recently, Symantec independently described how to make the NameNode highly available using Veritas Cluster [5]. The Figure above illustrates the NameNode HA in Hadoop 1 using Linux HA that will be available along with a similar solution using vSphere HA.

FAQ

You are using cold failover and hence it is not practical.
The HDFS 2.0 HA design was driven by the needs of the very large clusters at Yahoo and Facebook. For small to medium clusters cold failover is only 30 to 120 seconds slower, as described above.

If this was so easy to do with Linux HA or other tools why didn’t the HDFS community do this earlier?
This is partly because the original HDFS team focused on very large clusters where cold failover was not practical. We assumed that Hadoop needed to provide its own built-in solution As we’ve developed this technology, we’ve heard directly from our customers that HA solutions are complex and that they prefer using their existing, well understood, solutions.

You have taken a different path from HA in Hadoop 2.
Not true – both are based on the same design principals. The difference is the focus on cold failover instead of hot failover. Further, the work is complimentary. All the work in the Hadoop 1.0 line is also being added to the Hadoop 2.0 line.

I need to use Linux HA, or another framework – isn’t that a hindrance?
Many of Hadoop users already use Linux HA, Red Hat HA, Veritas Cluster, or vSphere HA in their data centers. Linux HA is freely available and the cost of Red Hat’s HA is fairly low.

So is the Full Stack HA only a part of Hadoop 1?
No, Full Stack HA is orthogonal to the failover of a specific component such as the NameNode or the Job Tracker (see this post). Making the rest of the stack robust against transient failures of the layers underneath improves the entire stack. We will cover more detail of Full Stack HA in a up coming blog.

Does vSphere HA deal with service failures in addition to VM failures?
vSphere allows application level health checks and we have added an application-level monitor for the NameNode. A similar JobTracker-specific monitor will be available shortly as well.

When using vSphere HA solution, what is the advantage of hosting the NameNode and JobTracker each in their own vSphere VirtualMachine?
Hosting the master services in isolated vSphere VMs is an effective design. As vSphere monitors and manages each VM independently, vSphere servers can host independent VMs containing the NameNode, JobTracker, and other master services. If one service fails, that VM is killed and restarted – while the other VMs continue uninterrupted. You also gain the ease of maintenance and rollback that VMs offer.

Status

All patches to core hadoop have been committed to Apache Hadoop trunk and branch 1.1; these are also incorporated in Hortonworks HDP 1 and HDP 1.1 releases. We plan to commit these to Hadoop 2-alpha.

Our monitoring code is targeted for inclusion into the still-in-incubation Ambari project; it has already been submitted as a patch [8].

Future Outlook

Work is in progress to stabilize HDFS 2 which is currently entering beta testing and also Hadoop 2‘s HA (Hot Failover, failover controller, etc). We are also in the progress of providing failover for other Hadoop components, such as the JobTracker, in Hadoop 1.

References

  1. Linux HA: http://www.linux-ha.org/wiki/Main_Page
  2. Red Hat High Availability Add-On: http://www.redhat.com/products/enterprise-linux-add-ons/high-availability/
  3. vSphere HA: http://www.vmware.com/products/high-availability/overview.html
  4. Symantics’ Veritas Custer Framework: https://www.symantec.com/cluster-server
  5. NN HA using Symantics’ Veritas Custer Framework: http://www.symantec.com/connect/blogs/symantec-high-availability-solution-hadoop-namenode
  6. Hadoop Journal Daemon: HDFS-3092 and HDFS-3077
  7. HDFS NameNode HA design (for Hadoop 2): HDFS-1623
  8. Monitoring Library: Ambari-504
  9. Full Stack HA: http://hortonworks.com/blog/high-availability-and-hadoop-1-0-perfect-together/

High Availability and Hadoop 1.0 – Perfect Together

In Shaun Connolly’s post about balancing community innovation and enterprise stability, he discussed the importance of an open source project forging ahead with big improvements that are expected to be initially buggy and incomplete functionally but then stabilize over time. In the case of Apache Hadoop 2.0, currently in community Alpha release, the big improvements have been underway for the past 3 years and include such things as:

  1. Next-gen MapReduce (aka YARN) that opens up Hadoop’s job processing architecture to other application workloads beyond MapReduce,
  2. New HDFS pipe-line to support append and flush,
  3. HDFS federation and performance improvements that enable Hadoop to scale to 1000’s more nodes in a cluster, and
  4. High availability improvements that address some of the single point of failure issues that are often used as examples of how Hadoop may not be as enterprise-ready as it could be.

In the case of high availability (HA), it can take many months or years to get these types of solutions rock solid. While Hadoop 2.0 contains important HA-related features such as HDFS hot standby, we want to make sure we give it time to complete its community release process and allow extra time after that for bugs to be found and fixed to harden it for broad enterprise production use.

Read More