Disk space and memory recommendations

to create new topics or reply. | New User Registration

This topic contains 1 reply, has 2 voices, and was last updated by  Sameer Farooqui 3 years, 3 months ago.

  • Creator
  • #4025

    Li Tao Zhen


    we plan on processing about 1 TB of log data daily, and then store it in HBase for querying.

    How much space / RAM should we allocate on each data node to ensure that both the datanode, map-reduce, and Hbase get enough resources?

    How much total Disk space and RAM should we allocate to each?



Viewing 1 replies (of 1 total)

You must be to reply to this topic. | Create Account

  • Author
  • #4056

    Li – we typically don’t recommend running traditional MapReduce jobs on the same cluster hosting HBase. HBase applications demand low latency in the tens of milliseconds. If the region server nodes are simultaneously using their memory/disk/CPU resources for MR, then HBase latency could go up significantly.

    That being said, am I correct in assuming that you will be generating 1 TB of fresh log data daily? So, in a year you’ll have around 350 TB of data in HBase, so I can help you brainstorm how to spec out such a cluster. Actually, with replication factor = 3, you’ll need a cluster capable of storing a PB of data.

    Just an FYI, a typical HBase slave node at Yahoo! looks like this: 8-core CPU, 24 GB RAM, 12 TB disk

    Rules of thumb to start with:

    Master Node: 24 GB RAM total, give the HBase Master daemon 4 GB, NN 8 GB, JT 2 GB, 8 core CPU, RAID 0+1 or 1+0 local disks

    SNN daemon: 8 GB RAM

    Slave Servers: 24+ GB RAM total, give the HBase region server 12 GB, DN 1 GB, TT 1 GB, ZK 1 GB

    A typical Region Server will have 6 – 12 disks, each 1 – 2 TB in size. Use 3.5″ SATA disks if you can for highest reliability (some new servers may require 2.5″ disks though). Use ext3 or ext4 file systems on the disks.

    So, if you use 50 slave servers, each one will host about 7 TB of real data, but 21 TB after considering r=3.

    As far as actual #s, it depends on your work load. Here’s a calculation you may find handful. Four 1 TB disks in one node gives you 400 IOPS or 400 MB/second transfer throughput for cold data access. If you instead throw eight 500 GB disks, you get 800 IOPS or 800 MB/second transfer.

    Don’t go over 16 GB for the region server heap size, as this has been known to cause issues with garbage collection.

Viewing 1 replies (of 1 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.