A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”
It’s about time and snowflakes.
JBOD and the Allure of RAID-0
In Hadoop clusters, we recommend treating each disk separately, in a configuration that is known, somewhat disparagingly as “JBOD”: Just a Box of Disks.
In comparison RAID-0, which is a bit of misnomer, there being no redundancy, stripes data across all the disks in the array. This promises some advantages:
- Higher IO rates on small accesses
- Higher bandwidth on larger accesses -especially write operations
- Eliminates a hot-spot of a single disk overloaded if it’s data is more in demand
In RAID=0, data is striped across disks. When data needs to be written, it is divided up into small blocks (64KB or more).…