The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

HDFS Forum

How easily understand hadoop and their nodes?

  • #20800
    eli vani

    i am doing my project in hadoop. still i can’t properly understand what is hadoop and their nodes(name node.task traker,job tracker).please Can you give me the easy explanation for this?it will helpful for my carrier..Thank you…

  • Author
  • #20829

    Hi Eli,

    Thanks for your question.

    The "easy explanation" follows:

    These are all daemon processes.

    NameNode – Stores the information of on which DataNode the data is stored. This data is broken into blocks and replicated a number of times and then stored on various DataNodes throughout the cluster. There is usually only one of these in a cluster.

    DataNodes – the nodes on which the data is actually stored. As mentioned above it has been split into blocks and each block replicated a number of times (usually 3) and these replicas are placed on different DataNodes to provide a bit of fault tolerance. There can be any number of DataNodes in a cluster, from one to thousands. The number is only limited by the RAM available on the NameNode, the NameNode stores it’s working data in RAM. The DataNodes store the actual data on the harddrives.

    JobTracker – When a job for analyzing this data is submitted this daemon talks to the NameNode to figure out which nodes the data is stored on so that it can split this job into tasks and then send the task to the TaskTracker that is closest to the actual data being processed. There is also usually only one JobTracker per cluster.

    TaskTracker – these daemons are sent individual tasks to perform by the JobTracker. They perform this task and send the results back to the JobTracker. There can, and should be, as many TaskTrackers in a cluster as there are DataNodes. A typical Hadoop cluster will have a TaskTracker running on each DataNode. This is so that under ideal circumstances the data will be processed by the computer/node on which it resides thus reducing the need to transfer data around to be processed.

    I hope this helps your understanding of Hadoop.

    Good Boy

    Hi Ted,

    This is really a nice and the easiest explanation. Thanks a lot.
    I have some doubts. Could you please clear them.
    As you have mentioned that “each block of data is replicated a number of times (usually 3) and these replicas are placed on different DataNodes” and “the job tracker will find out the task trackers closest to the data blocks that we need to process our job”, then will the job tracker send the task to all the task trackers closest to all the replicas of a data block (usually 3 as you have mentioned earlier) or to any one of them..?

    Could you please explain how the Map/reduce is processed, that will be a great help.

    Thanks in advance


    Hi Member,
    The jobtracker will only assign a job to one tasktracker that is on the same node as the block it needs to work on. As far as the whole mapreduce process, Hortonworks provides a developer course training that covers that process. If interested in taking the course, here is a link:

    Kind Regards,

The topic ‘How easily understand hadoop and their nodes?’ is closed to new replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.