Home Forums HDFS JBOD and 3 Replicas vs RAID 5 and 2 replicas

This topic contains 3 replies, has 3 voices, and was last updated by  Larry Liu 1 year, 5 months ago.

  • Creator
    Topic
  • #16993

    Oner Aktas
    Member

    Hi,

    We are planning hardware for HBASE on HDFS iimplementation, and disk sizing effects the number of servers to buy. We could not make a decision about disk configuration, and I am trying to get comprairison of JBOD and RAID5.

    For the data nodes : Would using JBOD and 3 replicas provide similar assurance for data availability to using RAID 5 and 2 replicas. (for 12 x 3 TB SATA 7200 RPM Disks)

    What would the effect be on performence, would RAID 5 better or worse?

    Thank you,

    Oner

Viewing 3 replies - 1 through 3 (of 3 total)

The topic ‘JBOD and 3 Replicas vs RAID 5 and 2 replicas’ is closed to new replies.

  • Author
    Replies
  • #17076

    Larry Liu
    Moderator

    Hi, Oner

    For data nodes, JBOD should perform better since there are extra over heads on RAID5.

    For namenode, RAID can be used. But it is good idea to setup multiple drives in difs.name.dir in hdfs-site.xml

    Hope this helps

    Thanks

    Larry

    Collapse
    #17073

    Oner Aktas
    Member

    Hi Ted,

    Thank you for your reply. Although you said we do not need to use RAID, by using RAID 5 and having two replicas we can significantly reduce the storage required.
    For an example case of 100 TB data –> x 3 Replicas and + 30% overhead equates to = 390TB disk space required. By using name nodes of JBOD 12×3 TB disks that is 390/36=11 namenodes.

    If RAID 5 had no penalty (or acceptable level of penalty) 100 TB Data x 2 replicas + 30% overhead = 260 TB disk space would be required. By using RAID 5 of (11+1) that would mean 260 /33= 8 name nodes.

    You mentioned that increasing the number of namenodes would increase the performance, I note that. I am concerned that RAID 5 itself might cause performance issues, but I could not find any articles regarding performance comparison of JBOD vs RAID 5.

    Let me ask another question by using the case above what would be the performance impact of RAID 5 if I were to compare 11 namenodes with JBOD (12 x 3 TB Disks) vs. 11 namenodes wih RAID 5 (11+1 of 3TB disks).

    Thank you,

    Oner

    Collapse
    #16994

    tedr
    Member

    Hi Oner,

    Here is some information on configuring the harddrives on a hadoop cluster:

    Space needed: Take amount of data, multiply by 3 (replication), factor in percentage reduction via compression. Then add about 30% for Hadoop operating space and overhead

    Hadoop utilizes replication to protect against data loss and unavailability during disk or node outages. Thus, data node disks should be configured as JBOD. There is no need for RAID configurations for Hadoop cluster data nodes as redundancy is achieved through block replication.

    For the Name Node, the recommendation is to design redundancy into the hardware configuration to protect from the server going down. Thus configuring NameNode disks as RAID is a good idea.

    When scaling out hard drive space, you will have the option to add more disks to each node or to add more nodes themselves. With Hadoop, it is recommended to scale out by adding nodes than to scale out by making each machine more powerful. Adding more nodes means that replicas will be further spread out and thus will increase read/write performance, as the average network hops to get to a block in HDFS will decrease.

    SSD – HDFS has not been designed to take advantage of SSD disk I/O speeds. In most cases, performance bottleneck for Hadoop clusters is not disk latencies, but is network I/O or RAM amounts. The cost of larger SSD drives are still expensive and that expense is not worth the benefit to a Hadoop data node

    When configuring the disks – set noatime to improve read/write performance, and do not use LVM to make the disks appear as one.

    Thanks,

    Ted

    Collapse
Viewing 3 replies - 1 through 3 (of 3 total)