Here is some information on configuring the harddrives on a hadoop cluster:
Space needed: Take amount of data, multiply by 3 (replication), factor in percentage reduction via compression. Then add about 30% for Hadoop operating space and overhead
Hadoop utilizes replication to protect against data loss and unavailability during disk or node outages. Thus, data node disks should be configured as JBOD. There is no need for RAID configurations for Hadoop cluster data nodes as redundancy is achieved through block replication.
For the Name Node, the recommendation is to design redundancy into the hardware configuration to protect from the server going down. Thus configuring NameNode disks as RAID is a good idea.
When scaling out hard drive space, you will have the option to add more disks to each node or to add more nodes themselves. With Hadoop, it is recommended to scale out by adding nodes than to scale out by making each machine more powerful. Adding more nodes means that replicas will be further spread out and thus will increase read/write performance, as the average network hops to get to a block in HDFS will decrease.
SSD – HDFS has not been designed to take advantage of SSD disk I/O speeds. In most cases, performance bottleneck for Hadoop clusters is not disk latencies, but is network I/O or RAM amounts. The cost of larger SSD drives are still expensive and that expense is not worth the benefit to a Hadoop data node
When configuring the disks – set noatime to improve read/write performance, and do not use LVM to make the disks appear as one.