The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

HBase Forum

Disk space and memory recommendations

  • #4025
    Li Tao Zhen


    we plan on processing about 1 TB of log data daily, and then store it in HBase for querying.

    How much space / RAM should we allocate on each data node to ensure that both the datanode, map-reduce, and Hbase get enough resources?

    How much total Disk space and RAM should we allocate to each?



  • Author
  • #4056
    Sameer Farooqui

    Li – we typically don’t recommend running traditional MapReduce jobs on the same cluster hosting HBase. HBase applications demand low latency in the tens of milliseconds. If the region server nodes are simultaneously using their memory/disk/CPU resources for MR, then HBase latency could go up significantly.

    That being said, am I correct in assuming that you will be generating 1 TB of fresh log data daily? So, in a year you’ll have around 350 TB of data in HBase, so I can help you brainstorm how to spec out such a cluster. Actually, with replication factor = 3, you’ll need a cluster capable of storing a PB of data.

    Just an FYI, a typical HBase slave node at Yahoo! looks like this: 8-core CPU, 24 GB RAM, 12 TB disk

    Rules of thumb to start with:

    Master Node: 24 GB RAM total, give the HBase Master daemon 4 GB, NN 8 GB, JT 2 GB, 8 core CPU, RAID 0+1 or 1+0 local disks

    SNN daemon: 8 GB RAM

    Slave Servers: 24+ GB RAM total, give the HBase region server 12 GB, DN 1 GB, TT 1 GB, ZK 1 GB

    A typical Region Server will have 6 – 12 disks, each 1 – 2 TB in size. Use 3.5″ SATA disks if you can for highest reliability (some new servers may require 2.5″ disks though). Use ext3 or ext4 file systems on the disks.

    So, if you use 50 slave servers, each one will host about 7 TB of real data, but 21 TB after considering r=3.

    As far as actual #s, it depends on your work load. Here’s a calculation you may find handful. Four 1 TB disks in one node gives you 400 IOPS or 400 MB/second transfer throughput for cold data access. If you instead throw eight 500 GB disks, you get 800 IOPS or 800 MB/second transfer.

    Don’t go over 16 GB for the region server heap size, as this has been known to cause issues with garbage collection.

The forum ‘HBase’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.