Proper Care and Feeding of Drives in a Hadoop Cluster: A Conversation with StackIQ’s Dr. Bruno
In a recent blog post, Hortonworks’ Steve Loughran discussed Apache Hadoop’s preference for JBOD-configured storage vs. the allure of RAID-0. As more enterprises are beginning to move beyond the science experiment stage and begin deploying Hadoop into their production environments, they are learning that Hadoop is quite different than other services in their data centers, such as web, mail, and database servers.They are learning that to achieve optimal performance, you need to pay particular attention to configuring the underlying hardware.
To find out more, we had a chat with Dr. Greg Bruno, VP of Engineering, and co-founder of StackIQ, a Hortonworks partner, about the real life implications of managing hard drives (HDDs) in a modern Hadoop cluster.
Q. Why isn’t it considered good practice to configure drives in Hadoop clusters as RAID-0 disk arrays?
A. Hadoop prefers a set of separate disks to the same set managed as a RAID-0 disk array. Read speeds are particularly important to the performance of a Hadoop cluster, and in his post, Steve makes the point that since drive speeds vary, and RAID-0 reads occur at the speed of the slowest disk in the array, a RAID-0 configuration may well be slower than a non-RAID configuration. The bigger issue, in my opinion, is reliability. If a set of disks is configured as a RAID-0 array, then one disk failure in that array will take that entire volume down, and if all the disks in a node are configured as a single RAID-0 array, then a single disk failure will take all the node’s data down. By configuring multiple disks in a RAID-0 array, you magnify the probability of that volume going offline due to a single disk failure and you maximize the amount of data that goes offline when that single failure occurs.
Q: Modern servers have a lot of disks. What’s the impact of losing a single disk when you have 12 3TB drive in each node?
A: When a single drive fails when Hadoop is configured in its default state, the ENTIRE NODE gets taken offline. Back when servers typically had 6 x 1.5TB drives in them, losing a single disk would cause the loss of 0.02% of total storage in a typical 10PB, three-replica setup. With today’s hardware — typically 12 x 3TB drives per node, losing a single disk results in the loss of five times as much data.
Q: Aren’t today’s HDDs much more reliable than they used to be? Is it worth the extra work to handle the rare cases when a drive fails?
A: While drives are much more reliable than they used to be, they are still the cause of the lion’s share of support tickets in a Hadoop cluster. In fact, according to Bharath Mundlapudi, a Core Hadoop Engineer while working at Yahoo, disk drive failures account for fully 50% of siteops trouble tickets. That’s more than three times the next highest source of tickets.
Q: What does that represent in real terms?
A: It represents a lot of work for systems administrators. How much depends on the size and age of the cluster in question. For example, Facebook, which has some very large clusters, reports that their failure detection and automated repair system is doing the work of approximately 200 full time system administrators.
Q: OK, but not many organizations have clusters that large. What about a typical enterprise setup?
A: Our experience indicates that a 1,000 node cluster containing 12,000 drives for a total raw storage capacity of 48 peta-bytes can expect about 3 drive failures a day in its third year of operation. Drive failure rates rise as the devices age. For a 500 node cluster, you’re looking at a drive failure every 17 hours or so.
Q: Doesn’t this make it hard for the cluster operator to manage? How do they keep up?
A: Without the right tools and methodology, it is very difficult for cluster operators to manage clusters at scale. They typically have to write scripts to scan the cluster, detect disk failures, and report them. Then, once the offending drive has been replaced, commands must be run for the controller to recognize the new drive, OS commands need to be executed to format the drive, and then some Hadoop commands are required to add the disk back to the configuration.
Q: Presumably it’s not quite as challenging for StackIQ customers?
A: StackIQ’s mission is to make cluster operation as painless as possible, which is why we have developed tools to manage the entire lifecycle of the disk. While we haven’t figured out how to get our software to physically pull a bad drive and replace it with a new one, we automate the rest of it — from the initial deployment of the drive, detecting and reporting the error, and re-integrating the replacement drive into the configuration.
One of the features we’ve developed in StackIQ’s management software automatically configures chassis with LSI MegaRaid controllers into “JBODs”, that is, every disk in the chassis will be configured as an individual device.
In addition, a user can specify which disk they want in the chassis to be the boot disk via an attribute (e.g., “bootdisk0″) and if an optional secondary boot disk attribute is specified (“bootdisk1″), then our code will configure both those disks as a “mirror” (RAID1) while still making all the other non-boot disks available to Hadoop as individual disks. A recent StackIQ customer made their purchasing decision on this feature alone, as they recently went through the painful exercise of changing a mid-size cluster’s RAID configuration by booting each server, one-by-one, catching a key press at the controller prompt, and fixing the configuration by-hand. Not a fun exercise when you are under the gun by management to get production cluster online.
Q: With that many drive failures, clusters will be chewing through disks at a brisk rate. That could get expensive. That works out to something like 1000 drives/year X $100/drive = $100k per year just for replacement drives.
A: True, which speaks to the need for software which will make the most efficient use of your resources — intelligent, automated cluster management software can find faulty drives automatically, and bring up a replacement drive quickly.
Q: Doesn’t automation take control out of the hands of the skilled cluster operators?
A: We believe it should be up to the cluster operator to set policies on how much automation to incorporate into their workflows. Our software reflects that philosophy, letting operators choose from a range of policies that go all the way from having the operator run all the commands manually, all the way to a fully automated repair where all the operator needs to do is push in the new drive and let StackIQ’s software do the rest.
Q: Can’t this be done with a simple command script that runs on all nodes?
A: That might be workable in a homogeneous environment, where all the nodes are the same. But in the real world, different nodes require different configurations. Even the disks are likely configured differently in nodes within the clusters. Handling those variables in a static script would be very difficult to accomplish. For example, if your cluster expands over time, you may be adding chassis with different drive configurations. Static scripts wouldn’t be able to deal with this situation. The StackIQ management software has intimate knowledge of the hardware and software in the cluster, so it knows exactly how to handle each drive in each node across the entire cluster, even in a heterogeneous environment.
So there you have it. The folks behind StackIQ cluster management software agree with Steve Loughran’s recommendation to forego RAID-0 for Hadoop clusters. In fact, they provide the management tools to make it easier to do. So take the advice of our experts, and configure your cluster servers as “Just a Bunch of Disks.”
~ Lisa Sensmeier