The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

HDFS Forum

JBOD and 3 Replicas vs RAID 5 and 2 replicas

  • #16993
    Oner Aktas


    We are planning hardware for HBASE on HDFS iimplementation, and disk sizing effects the number of servers to buy. We could not make a decision about disk configuration, and I am trying to get comprairison of JBOD and RAID5.

    For the data nodes : Would using JBOD and 3 replicas provide similar assurance for data availability to using RAID 5 and 2 replicas. (for 12 x 3 TB SATA 7200 RPM Disks)

    What would the effect be on performence, would RAID 5 better or worse?

    Thank you,


  • Author
  • #16994

    Hi Oner,

    Here is some information on configuring the harddrives on a hadoop cluster:

    Space needed: Take amount of data, multiply by 3 (replication), factor in percentage reduction via compression. Then add about 30% for Hadoop operating space and overhead

    Hadoop utilizes replication to protect against data loss and unavailability during disk or node outages. Thus, data node disks should be configured as JBOD. There is no need for RAID configurations for Hadoop cluster data nodes as redundancy is achieved through block replication.

    For the Name Node, the recommendation is to design redundancy into the hardware configuration to protect from the server going down. Thus configuring NameNode disks as RAID is a good idea.

    When scaling out hard drive space, you will have the option to add more disks to each node or to add more nodes themselves. With Hadoop, it is recommended to scale out by adding nodes than to scale out by making each machine more powerful. Adding more nodes means that replicas will be further spread out and thus will increase read/write performance, as the average network hops to get to a block in HDFS will decrease.

    SSD – HDFS has not been designed to take advantage of SSD disk I/O speeds. In most cases, performance bottleneck for Hadoop clusters is not disk latencies, but is network I/O or RAM amounts. The cost of larger SSD drives are still expensive and that expense is not worth the benefit to a Hadoop data node

    When configuring the disks – set noatime to improve read/write performance, and do not use LVM to make the disks appear as one.



    Oner Aktas

    Hi Ted,

    Thank you for your reply. Although you said we do not need to use RAID, by using RAID 5 and having two replicas we can significantly reduce the storage required.
    For an example case of 100 TB data –> x 3 Replicas and + 30% overhead equates to = 390TB disk space required. By using name nodes of JBOD 12×3 TB disks that is 390/36=11 namenodes.

    If RAID 5 had no penalty (or acceptable level of penalty) 100 TB Data x 2 replicas + 30% overhead = 260 TB disk space would be required. By using RAID 5 of (11+1) that would mean 260 /33= 8 name nodes.

    You mentioned that increasing the number of namenodes would increase the performance, I note that. I am concerned that RAID 5 itself might cause performance issues, but I could not find any articles regarding performance comparison of JBOD vs RAID 5.

    Let me ask another question by using the case above what would be the performance impact of RAID 5 if I were to compare 11 namenodes with JBOD (12 x 3 TB Disks) vs. 11 namenodes wih RAID 5 (11+1 of 3TB disks).

    Thank you,


    Larry Liu

    Hi, Oner

    For data nodes, JBOD should perform better since there are extra over heads on RAID5.

    For namenode, RAID can be used. But it is good idea to setup multiple drives in in hdfs-site.xml

    Hope this helps



The topic ‘JBOD and 3 Replicas vs RAID 5 and 2 replicas’ is closed to new replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.