Home Forums HBase Disk space and memory recommendations

This topic contains 1 reply, has 2 voices, and was last updated by  Sameer Farooqui 2 years, 5 months ago.

  • Creator
    Topic
  • #4025

    Li Tao Zhen
    Member

    Hi,

    we plan on processing about 1 TB of log data daily, and then store it in HBase for querying.

    How much space / RAM should we allocate on each data node to ensure that both the datanode, map-reduce, and Hbase get enough resources?

    How much total Disk space and RAM should we allocate to each?

    Thanks,

    Li

Viewing 1 replies (of 1 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #4056

    Li – we typically don’t recommend running traditional MapReduce jobs on the same cluster hosting HBase. HBase applications demand low latency in the tens of milliseconds. If the region server nodes are simultaneously using their memory/disk/CPU resources for MR, then HBase latency could go up significantly.

    That being said, am I correct in assuming that you will be generating 1 TB of fresh log data daily? So, in a year you’ll have around 350 TB of data in HBase, so I can help you brainstorm how to spec out such a cluster. Actually, with replication factor = 3, you’ll need a cluster capable of storing a PB of data.

    Just an FYI, a typical HBase slave node at Yahoo! looks like this: 8-core CPU, 24 GB RAM, 12 TB disk

    Rules of thumb to start with:

    Master Node: 24 GB RAM total, give the HBase Master daemon 4 GB, NN 8 GB, JT 2 GB, 8 core CPU, RAID 0+1 or 1+0 local disks

    SNN daemon: 8 GB RAM

    Slave Servers: 24+ GB RAM total, give the HBase region server 12 GB, DN 1 GB, TT 1 GB, ZK 1 GB

    A typical Region Server will have 6 – 12 disks, each 1 – 2 TB in size. Use 3.5″ SATA disks if you can for highest reliability (some new servers may require 2.5″ disks though). Use ext3 or ext4 file systems on the disks.

    So, if you use 50 slave servers, each one will host about 7 TB of real data, but 21 TB after considering r=3.

    As far as actual #s, it depends on your work load. Here’s a calculation you may find handful. Four 1 TB disks in one node gives you 400 IOPS or 400 MB/second transfer throughput for cold data access. If you instead throw eight 500 GB disks, you get 800 IOPS or 800 MB/second transfer.

    Don’t go over 16 GB for the region server heap size, as this has been known to cause issues with garbage collection.

    Collapse
Viewing 1 replies (of 1 total)