Li – we typically don’t recommend running traditional MapReduce jobs on the same cluster hosting HBase. HBase applications demand low latency in the tens of milliseconds. If the region server nodes are simultaneously using their memory/disk/CPU resources for MR, then HBase latency could go up significantly.
That being said, am I correct in assuming that you will be generating 1 TB of fresh log data daily? So, in a year you’ll have around 350 TB of data in HBase, so I can help you brainstorm how to spec out such a cluster. Actually, with replication factor = 3, you’ll need a cluster capable of storing a PB of data.
Just an FYI, a typical HBase slave node at Yahoo! looks like this: 8-core CPU, 24 GB RAM, 12 TB disk
Rules of thumb to start with:
Master Node: 24 GB RAM total, give the HBase Master daemon 4 GB, NN 8 GB, JT 2 GB, 8 core CPU, RAID 0+1 or 1+0 local disks
SNN daemon: 8 GB RAM
Slave Servers: 24+ GB RAM total, give the HBase region server 12 GB, DN 1 GB, TT 1 GB, ZK 1 GB
A typical Region Server will have 6 – 12 disks, each 1 – 2 TB in size. Use 3.5″ SATA disks if you can for highest reliability (some new servers may require 2.5″ disks though). Use ext3 or ext4 file systems on the disks.
So, if you use 50 slave servers, each one will host about 7 TB of real data, but 21 TB after considering r=3.
As far as actual #s, it depends on your work load. Here’s a calculation you may find handful. Four 1 TB disks in one node gives you 400 IOPS or 400 MB/second transfer throughput for cold data access. If you instead throw eight 500 GB disks, you get 800 IOPS or 800 MB/second transfer.
Don’t go over 16 GB for the region server heap size, as this has been known to cause issues with garbage collection.