Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
November 09, 2012
prev slideNext slide

Why not RAID-0? It’s about Time and Snowflakes

A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”

It’s about time and snowflakes.

JBOD and the Allure of RAID-0

In Hadoop clusters, we recommend treating each disk separately, in a configuration that is known, somewhat disparagingly as “JBOD”: Just a Box of Disks.

In comparison RAID-0, which is a bit of misnomer, there being no redundancy, stripes data across all the disks in the array. This promises some advantages:

  • Higher IO rates on small accesses
  • Higher bandwidth on larger accesses -especially write operations
  • Eliminates a hot-spot of a single disk overloaded if it’s data is more in demand

In RAID=0, data is striped across disks. When data needs to be written, it is divided up into small blocks (64KB or more). One of these blocks is written to each disk simultaneously. When the data is read back, all the blocks can again be read from all disks simultaneously. The result of this is that your disk bandwidth increases with the size of the array. If you had eight disks mounted as RAID-0, then the theoretical maximum write and read bandwidth is eight times faster than a single disk.

With the disk controllers built into modern servers, RAID-0 is an option that can be turned on: so why not?

Reliability

Reliability is one issue.

Disks can get slower as they age, as they start to get read errors and have to retry reading bits of the disk platter. A slow disk is a warning sign that maybe you should think about replacing that disk. With Hadoop in a JBOD setting, you can unmount the disk; the Datanode will notice it is missing and report to the Namenode that all the data on it needs re-replication. If you have a RAID-0 disk, everything across all disks is missing – you need to add a new disk to bring the array back up to size, reformat all the disks, and bring up the Datanode without any storage. Over time it will pick up more data, from rebalancing and jobs run on it.

You have to do that whenever any of the disks fails – the more disks you have, the more common it is.

Before panicking – disk failures are rare. Google’s 2007 paper, Failure Trends in a Large Disk Drive Population, reported that in their datacenters, 1.7% of disks failed in the first year of their life, while three-year-old disks were failing at a rate of 8.6%. About 9% isn’t a good number. Returning to the hypothetical eight-disk server, the probability of each disk lasting the year would be:

1 – 0.086 =0.914
The probability that all disks make it to their next birthday becomes:

0.914^8 = 0.487
If those google numbers matched that of the disks in your servers – and weren’t due to a really bad batch of disks – then during that third year, about half the datanodes would lose all their data and need to be rebuilt. If you have one of the latest twelve-disk servers, things get even worse.

Hadoop copes with reliability by duplicating data across servers: if one copy of an HDFS data block (64MB of greater) is lost or corrupt, there are usually two copies elsewhere to recover.

Only now, with all the data in a server lost, the amount of data to replicate on a disk failure increases linearly with the number of disks in each server – while the probability of the server failing also increases. Whereas before, each those failing year-three disks would have a probability of failing of 0.914 %, with the amount of data being the size of the disk: 1-3 TB of data.

That eight disk cluster would have to transmit 8-24 TB of data, and do it eight times as frequently. That’s going to be slightly more noticeable.

If you do want to use RAID-0 storage, configuring an eight-disk server as four pairs of RAID-0 storage is much less risky. The IO performance could be double that of a single drive, but so the risk of either failing would be less, and the cost of recovering the data also very much reduced.

Disk failures, then, are the first reason you don’t want to use RAID-0 storage – now to the second.

Every Disk is a Unique Snowflake

Hadoop job performance depends on disk bandwidth, especially the read bandwidth.

On RAID-0 Storage the disk accesses go at the rate of the slowest disk. It’s always been believed that the disks would all start out taking the same speed, and only degrade over time. Recent research shows things are worse than this: that the current generations of hard disks vary in performance from day one.

The 2011 paper, Disks Are Like Snowflakes: No Two Are Alike, measured the performance of modern disk drives, and discovered that they can vary in data IO rates by 20%, even when they are all writing to same part of the hard disk. This is because the latest manufacturing processes produce disks with different surface characteristics – altering the density at which the disks can store data. Rather than set the disk electronics up to only support the lowest measured performance, or to discard the slowest disks, manufacturers now use a technique called Adaptive Zoning. The newly manufactured disks are calibrated to their performance, and the the controllers configured to drive each zone in the disk at the highest rate that zone supports.

This is profound – and it’s not something that the disk manufacturers have been publicising. Modern CPU’s are “binned” into parts that support different rates – but those rates are published and you get to pay more for the faster parts. Here the speed of the disk varies, and you just have to hope your parts are the fast ones. In the experiments the authors of the paper conducted, some disks could deliver 105 Megabytes of data a second, with the actual range being 90-111 MB/s.

If you have eight disks, some will be faster than the others, right from day one. And your RAID-0 storage will deliver the performance of the slowest disk right from the day you unpack it from its box and switch it on.

That is the other why we don’t recommend configuring your servers’ storage as one large RAID-0 array.

Summary

To summarize: RAID-0 storage appears to increase disk I/O times, but it will deliver data at the rate of the slowest disk in the array – an array whose disk speeds can vary by up to 20%. When a single disk eventually fails, all the data on the server is lost, forcing you to reformat all the disks and waiting for HDFS to repopulate the server with new data.

  1. We use JBOD storage for all our worker nodes – and recommend our customers to do too.
  2. If anyone does insist on RAID-0 storage, restrict it to pairs of disks – so keeping the risk and cost of failures down.

Update: Single Drive RAID-0

Some people asked us what about RAID-0 and single drives – so we’d like to clarify this: Hadoop works perfectly well if you configure each drive as a single RAID-0 volume.

This configuration comes about with disk controllers that expect everything to be RAIDed, or if the controller can support JBOD or RAID modes -and you declare one pair of disks as RAID-1. Why would anyone want do declare two disks as RAID-1 – mirrored disks? If you put the OS on that RAID volume then the failure of either of those disks will not stop the server. In the very large 500+ node clusters people don’t do this -they have enough spare servers around, and would rather have the extra storage and bandwidth. On smaller clusters, having the OS on a mirrored pair of disks downgrades a server failure from a serious problem (re-replication of all the server’s data, new disk needed in a few hours) to a task (get a replacement disk and swap it into the server when you get a chance).

The problems with RAID-0 -amplified data replication on a disk failure, performance of the slowest disk- increase with the number of disks. With a single disk, you get exactly the same numbers as you would if the disk controller considered it a JBOD drive.

Tags:

Comments

  • Will the placement algorithm for HDFS try to balance at the level of the individual hard drive, and will rebalancing HDFS rebalance at the level of the individual hard drive? If not, how does one deal with having really hot volumes and really cold volumes when running with JBOD?

  • @Warren: no, HDFS doesn’t rebalance on individual drives (yet); there’s a JIRA on it, because as the size of HDDs increase, the imbalance of a replaced drive is becoming bigger. We generally rely on time to sort things out, and the fact that once the other disks become full, that new disk gets more use.

    See: https://issues.apache.org/jira/browse/HDFS-1312

    the initial solution may just be a .py script to move data around on the server -no need for cross-server rebalancing.

    Hot vs Cold: again, relies on a mix of workloads. It’s considered better to run mappers on servers with the blocks and space for tasks irrespective of HDD load. To do anything else would need a lot more tracking of block location within a server, and the activity on drives at the current moment.

    YARN-3 [ https://issues.apache.org/jira/browse/YARN-3 ] has added Linux cgroup limits on CPU & RAM for containers; this could be extended to limiting IO bandwidth for tasks, so you could give lower priority workloads less bandwidth than higher ones….

  • There is a third reason why Raid, including Raid-0, hurts performance, and it’s strictly mechanical. When large HDFS blocks are striped across multiple disks, there are more mechanical head seeks in total required to read or write the block. Under heavy I/O loads, this results in disk thrashing.

    When reading a large HDFS block of 64MB or larger from a Raid set with a much smaller chunk size, every disk must move its heads to read its chunks of the block. The number of disk seeks is the number of disks in the Raid set. If there is only one HDFS read happening at the time, that’s not much of a problem. But under heavy loads, there are lots of HDFS reads and writes happening concurrently, with every disk involved in each one of them. That’s a big problem.

    The ideal is to have each HDFS block read involve one disk seek and long, continuous read. The closer you get to that ideal, the better the disk performance.

    • > Terrible article, full of inaccuracies, avoid!
      Really? Why so?
      Looking at the paper, I’d be curious if someone has done similar experiments with SSDs, because you don’t get the variance between disks, instead the bandwidth of multiple devices.

  • RAID0 REALLY! This is an IT operation manager’s nightmare. In times where there is a lot pressure too do more with less due from budget cuts, who wants to make time for this? I’m sorry but disk failures are not that uncommon. They are still a high maintenance item. When a single disk fails in that RAID set your RAID0 is destroyed. There’s no automatic rebuilding a RAID0 like you can with RAID1 or RAID5 once a disk is replaced. Therefore now a human being has to reconfigure that NOW MISSING RAID0 partition. Again, who wants to add this to the already daunting tasks in an IT operations center. I don’t.

    • I’m disagreeing with you there on single server systems, which now extends to desktop boxes, where the lack of ECC is becoming a critical issues.

      if you are using HDFS-style replication, then you can rebuild stuff from elsewhere in the cluster. That helps you survive server failures as well as RAID problems.

      I’ve an old talk, Did you really want that data, on the topic -which even looks at problems in RAID arrays.

      • What do you plan on doing with all these failed raid0 partitions? When a single disk fails in that RAID set your RAID0 is destroyed. Someone has to take time to recreate it manually. Nobody has time for that stuff. It’s a step backwards while we’re trying to do more with less people. IT Operations mangers are going to ignore the entire situation and laugh at the server owners and tell them to do it yourself. RAID0 RIGHT!!

        • I’m not actually advocating RAID0, have you noticed? I’m actually discussing some of the performance let-downs of RAID0 where you can get slowed down to whichever of the disks in the array are slower, so it’s not as good at people might think. Whereas JBOD, well, at least you get what you get on the tin. And in HDFS, when you get an HDD failure the cluster rebuilds from the other disks. Without human intervention. Even so, you do need to keep an eye on failure rates, as it can be a sign of problems, and unless you swap in new disks your cluster capacity will decrease.

          RAID5 has always been the reference point for keeping data safe, but as you note, rebuild time is the enemy; the issue being that HDD capacity has improved beyond that of the bandwidth; time to rebuild has increased, and that rebuild time is when you get more vulnerable. RAID6 and erasure coding are different: I don’t know the details well enough to have any valid opinion on the matter at all.

          There’s a recent (jan 2017) blog post from Martin Kleppmann on disk failure and rebuild times. I think he’s missed out that time-to-rebuild decreases on large clusters, but haven’t done the maths to really defend that belief. And again, the growing imbalance between bandwidth and capacity on HDD changes things; while SSD is going to fail differently, in ways which aren’t being written up in google/fb/msft paper yet. As well as that erasure code rebuild time; don’t know the details there at all.

          ( I should add that however you store your data, you need an off-site disaster recovery story. For Hadoop, distcp to azure wasb and amazon s3 does work as a backup mech, though it’s not so good for incremental backups as we can’t compare HDFS checksums and s3 etags to see if things have changed. Someone (you? me?) should do a better backup there by adding the ability to build the etag checksum on files in HDFS, so compare properly)

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>