The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

HDP on Linux – Installation Forum

Hadoop cluster planning

  • #28837


    I am planning to create Hadoop pre-production cluster of 12 nodes (10 slaves and 2 masters) and I have a number of open questions:

    1) Does slave node should use local storage or can use network storage too? What performance impact expected?
    2) On which server (from hardware point of view) secondary NameNode should be deployed? I am going to configure NameNodes cluster for HA.
    3) Is it possible to install NameNodes HA cluster using Ambari?
    4) How can I decide about necessarily of HBase installation? I need to decide about HBase installation for building hardware requirements.
    5) On which node (master or slave) should be installed : Hive, ZooKeeper, Nagios, Gangila, Pig, Oozie, Sqoop? Maybe it should be additional, out of Hadoop cluster machine? If yes, what hardware requirements?
    6) Does meta-storage of Hive should be local or remote? How can I calculate future capacity of Hive meta-storage?

    Thank you in advance.

  • Author
  • #28865

    Hi David,

    I’ll answer you questions the best that I can,
    1) Though slave nodes can be configured to use Network Storage, it is better to use local storage. If they are configured to use netwrok storage there will be a performance hit during processing as the data would need to be transferred to the node on which it was being processed. When configured for local storage ( actually a piece of HDFS ) the amount of transfer of data is minimized as Hadoop tried to have the data processed on the node where it is stored.
    2) The Secondary Name node daemon should be run on a separate machine from either of the HA NameNode boxes.
    3) Though it is possible to install Ambari to an HA cluster it is not currently supported.
    4) If you need real time updates to the data or have need of a columnar database then you should install HBase.
    5) The server daemons for each of the services you list should be installed to one of the slave nodes, with the exception of Zookeeper which should be installed to several (an odd number) of the slave nodes. The clients for these service should be installed to one of the slave nodes. The basic picture here is that the Master nodes should only be running the NameNode and JobTracker daemons, all else should ideally be on the slaves, taking care that you don’t run too many of the services on a single node and overload it.
    6) The metastore for hive is ideally installed locally to the hive server. It depends on which database you decide to use to figure out the future capacity.



    Thanks a lot Ted,
    As I understand I can deploy Secondary Name node on one of the slave boxes?


    Hi David,

    Yes, you can run the secondary namenode on one of the slaves. In fact that is the way a small cluster is configured by default when installed with Ambari.


The forum ‘HDP on Linux – Installation’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.