Home Forums HDP on Linux – Installation Hadoop cluster planning

This topic contains 3 replies, has 2 voices, and was last updated by  tedr 1 year, 4 months ago.

  • Creator
    Topic
  • #28837

    dgreenshtein
    Member

    Hi,

    I am planning to create Hadoop pre-production cluster of 12 nodes (10 slaves and 2 masters) and I have a number of open questions:

    1) Does slave node should use local storage or can use network storage too? What performance impact expected?
    2) On which server (from hardware point of view) secondary NameNode should be deployed? I am going to configure NameNodes cluster for HA.
    3) Is it possible to install NameNodes HA cluster using Ambari?
    4) How can I decide about necessarily of HBase installation? I need to decide about HBase installation for building hardware requirements.
    5) On which node (master or slave) should be installed : Hive, ZooKeeper, Nagios, Gangila, Pig, Oozie, Sqoop? Maybe it should be additional, out of Hadoop cluster machine? If yes, what hardware requirements?
    6) Does meta-storage of Hive should be local or remote? How can I calculate future capacity of Hive meta-storage?

    Thank you in advance.

Viewing 3 replies - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #29006

    tedr
    Moderator

    Hi David,

    Yes, you can run the secondary namenode on one of the slaves. In fact that is the way a small cluster is configured by default when installed with Ambari.

    Thanks,
    Ted.

    Collapse
    #28995

    dgreenshtein
    Member

    Thanks a lot Ted,
    As I understand I can deploy Secondary Name node on one of the slave boxes?

    Collapse
    #28865

    tedr
    Moderator

    Hi David,

    I’ll answer you questions the best that I can,
    1) Though slave nodes can be configured to use Network Storage, it is better to use local storage. If they are configured to use netwrok storage there will be a performance hit during processing as the data would need to be transferred to the node on which it was being processed. When configured for local storage ( actually a piece of HDFS ) the amount of transfer of data is minimized as Hadoop tried to have the data processed on the node where it is stored.
    2) The Secondary Name node daemon should be run on a separate machine from either of the HA NameNode boxes.
    3) Though it is possible to install Ambari to an HA cluster it is not currently supported.
    4) If you need real time updates to the data or have need of a columnar database then you should install HBase.
    5) The server daemons for each of the services you list should be installed to one of the slave nodes, with the exception of Zookeeper which should be installed to several (an odd number) of the slave nodes. The clients for these service should be installed to one of the slave nodes. The basic picture here is that the Master nodes should only be running the NameNode and JobTracker daemons, all else should ideally be on the slaves, taking care that you don’t run too many of the services on a single node and overload it.
    6) The metastore for hive is ideally installed locally to the hive server. It depends on which database you decide to use to figure out the future capacity.

    Thanks,
    Ted.

    Collapse
Viewing 3 replies - 1 through 3 (of 3 total)