A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”
It’s about time and snowflakes.
In Hadoop clusters, we recommend treating each disk separately, in a configuration that is known, somewhat disparagingly as “JBOD”: Just a Box of Disks.
In comparison RAID-0, which is a bit of misnomer, there being no redundancy, stripes data across all the disks in the array. This promises some advantages:
In RAID=0, data is striped across disks. When data needs to be written, it is divided up into small blocks (64KB or more). One of these blocks is written to each disk simultaneously. When the data is read back, all the blocks can again be read from all disks simultaneously. The result of this is that your disk bandwidth increases with the size of the array. If you had eight disks mounted as RAID-0, then the theoretical maximum write and read bandwidth is eight times faster than a single disk.
With the disk controllers built into modern servers, RAID-0 is an option that can be turned on: so why not?
Reliability is one issue.
Disks can get slower as they age, as they start to get read errors and have to retry reading bits of the disk platter. A slow disk is a warning sign that maybe you should think about replacing that disk. With Hadoop in a JBOD setting, you can unmount the disk; the Datanode will notice it is missing and report to the Namenode that all the data on it needs re-replication. If you have a RAID-0 disk, everything across all disks is missing – you need to add a new disk to bring the array back up to size, reformat all the disks, and bring up the Datanode without any storage. Over time it will pick up more data, from rebalancing and jobs run on it.
You have to do that whenever any of the disks fails – the more disks you have, the more common it is.
Before panicking – disk failures are rare. Google’s 2007 paper, Failure Trends in a Large Disk Drive Population, reported that in their datacenters, 1.7% of disks failed in the first year of their life, while three-year-old disks were failing at a rate of 8.6%. About 9% isn’t a good number. Returning to the hypothetical eight-disk server, the probability of each disk lasting the year would be:
Hadoop copes with reliability by duplicating data across servers: if one copy of an HDFS data block (64MB of greater) is lost or corrupt, there are usually two copies elsewhere to recover.
Only now, with all the data in a server lost, the amount of data to replicate on a disk failure increases linearly with the number of disks in each server – while the probability of the server failing also increases. Whereas before, each those failing year-three disks would have a probability of failing of 0.914 %, with the amount of data being the size of the disk: 1-3 TB of data.
That eight disk cluster would have to transmit 8-24 TB of data, and do it eight times as frequently. That’s going to be slightly more noticeable.
If you do want to use RAID-0 storage, configuring an eight-disk server as four pairs of RAID-0 storage is much less risky. The IO performance could be double that of a single drive, but so the risk of either failing would be less, and the cost of recovering the data also very much reduced.
Disk failures, then, are the first reason you don’t want to use RAID-0 storage – now to the second.
Hadoop job performance depends on disk bandwidth, especially the read bandwidth.
On RAID-0 Storage the disk accesses go at the rate of the slowest disk. It’s always been believed that the disks would all start out taking the same speed, and only degrade over time. Recent research shows things are worse than this: that the current generations of hard disks vary in performance from day one.
The 2011 paper, Disks Are Like Snowflakes: No Two Are Alike, measured the performance of modern disk drives, and discovered that they can vary in data IO rates by 20%, even when they are all writing to same part of the hard disk. This is because the latest manufacturing processes produce disks with different surface characteristics – altering the density at which the disks can store data. Rather than set the disk electronics up to only support the lowest measured performance, or to discard the slowest disks, manufacturers now use a technique called Adaptive Zoning. The newly manufactured disks are calibrated to their performance, and the the controllers configured to drive each zone in the disk at the highest rate that zone supports.
This is profound – and it’s not something that the disk manufacturers have been publicising. Modern CPU’s are “binned” into parts that support different rates – but those rates are published and you get to pay more for the faster parts. Here the speed of the disk varies, and you just have to hope your parts are the fast ones. In the experiments the authors of the paper conducted, some disks could deliver 105 Megabytes of data a second, with the actual range being 90-111 MB/s.
If you have eight disks, some will be faster than the others, right from day one. And your RAID-0 storage will deliver the performance of the slowest disk right from the day you unpack it from its box and switch it on.
That is the other why we don’t recommend configuring your servers’ storage as one large RAID-0 array.
To summarize: RAID-0 storage appears to increase disk I/O times, but it will deliver data at the rate of the slowest disk in the array – an array whose disk speeds can vary by up to 20%. When a single disk eventually fails, all the data on the server is lost, forcing you to reformat all the disks and waiting for HDFS to repopulate the server with new data.
Some people asked us what about RAID-0 and single drives – so we’d like to clarify this: Hadoop works perfectly well if you configure each drive as a single RAID-0 volume.
This configuration comes about with disk controllers that expect everything to be RAIDed, or if the controller can support JBOD or RAID modes -and you declare one pair of disks as RAID-1. Why would anyone want do declare two disks as RAID-1 – mirrored disks? If you put the OS on that RAID volume then the failure of either of those disks will not stop the server. In the very large 500+ node clusters people don’t do this -they have enough spare servers around, and would rather have the extra storage and bandwidth. On smaller clusters, having the OS on a mirrored pair of disks downgrades a server failure from a serious problem (re-replication of all the server’s data, new disk needed in a few hours) to a task (get a replacement disk and swap it into the server when you get a chance).
The problems with RAID-0 -amplified data replication on a disk failure, performance of the slowest disk- increase with the number of disks. With a single disk, you get exactly the same numbers as you would if the disk controller considered it a JBOD drive.