Home Forums HDFS How to recover a failed DataNode gracefully

This topic contains 9 replies, has 6 voices, and was last updated by  Seth Lyubich 1 year, 1 month ago.

  • Creator
    Topic
  • #11635

    Dear all:
    I encountered a very basic problem that is so common and so basic,i.e I just can not recover a failed hadoop Datanode gracefully without stopping all service on all nodes then starting all service on all nodes.

    Originally,I have hadoop cluster with 5 nodes(1 NameNodes:host001 +4 DataNodes host002,host003,host004,host005),When I shut off a DataNode(host005),HMC can found host005 DataNode down,then issued a warning blinking word on the HMC monitor.However, when I powered on this host005 DataNode again, host005 can not recover its hadoop services, like datanode, and tasktracker, so I still have ony 4 hadoop nodes workable.
    My method to recover host005 datanode’s service is “stop all services ,then start all service on HMC”, but it’s so stupid, and not practical in real world. so I want to know if anybody can suggest a better way for me to follow.
    What should I do if I want to recover a failed DataNode without stopping all service then starting all services.
    What’s the correct procedure ?
    Regards,
    Jeff

Viewing 9 replies - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #17619

    Seth Lyubich
    Keymaster

    Hi Vipul,

    Can you please clarify if your issue is related to original issue in this post? If this issue is regarding 2.0 alpha, can you please post your question here:

    http://hortonworks.com/community/forums/forum/hdp-2-0-alpha-feedback-2/

    Thanks for using HDP.

    Thanks,
    Seth

    Collapse
    #17578

    vipul chavda
    Member

    Hi Ted,
    I am using hortonworks 2.0 alpha.when I start deployement and i got message cluster install complete but HDFS start got failed i got below message.

    Deploy Logs

    {
    “2″: {
    “nodeReport”: {
    “PUPPET_KICK_FAILED”: [],
    “PUPPET_OPERATION_FAILED”: [],
    “PUPPET_OPERATION_TIMEDOUT”: [],
    “PUPPET_OPERATION_SUCCEEDED”: [
    "hp1.limco.com",
    "hp2.limco.com"
    ]
    },

    “\”Thu Mar 14 17:08:26 -0700 2013 Scope(Hdp2::Configfile[/etc/hadoop/conf/hdfs-site.xml]) (warning): Could not look up qualified variable ‘::hdp-hadoop::params::dfs_datanode_failed_volume_tolerated’; class ::hdp-hadoop::params has not been evaluated\”",

    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode']/Exec[su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode']/returns (err): change from notrun to 0 failed: su – hdfs -c ‘/usr/lib/hadoop/sbin/hadoop-daemon.sh –config /etc/hadoop/conf start namenode’ returned 1 instead of one of [0] at /etc/puppet/agent/modules/hdp2/manifests/init.pp:255\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode']/Anchor[hdp2::exec::su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode'::end] (warning): Skipping because of failed dependencies\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1]/Anchor[hdp2::exec::sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1::begin] (warning): Skipping because of failed dependencies\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1]/Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1] (warning): Skipping because of failed dependencies\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1]/Anchor[hdp2::exec::sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/ha

    Collapse
    #17571

    tedr
    Moderator

    Hi Vipul,

    thanks for trying Hortonworks Data Platform.

    A bit more information as to what you are attempting here would be helpful? Was this an attempt to add nodes? A new Install? What version of HDP is this happening on? You could also help us by following the instructions here http://hortonworks.com/community/forums/topic/hmc-installation-support-help-us-help-you

    Thanks,
    Ted.

    Collapse
    #17525

    vipul chavda
    Member

    HI
    Deploy Logs

    {
    “2″: {
    “nodeReport”: {
    “PUPPET_KICK_FAILED”: [],
    “PUPPET_OPERATION_FAILED”: [],
    “PUPPET_OPERATION_TIMEDOUT”: [],
    “PUPPET_OPERATION_SUCCEEDED”: [
    "hp1.limco.com",
    "hp2.limco.com"
    ]
    },

    “\”Thu Mar 14 17:08:26 -0700 2013 Scope(Hdp2::Configfile[/etc/hadoop/conf/hdfs-site.xml]) (warning): Could not look up qualified variable ‘::hdp-hadoop::params::dfs_datanode_failed_volume_tolerated’; class ::hdp-hadoop::params has not been evaluated\”",

    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode']/Exec[su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode']/returns (err): change from notrun to 0 failed: su – hdfs -c ‘/usr/lib/hadoop/sbin/hadoop-daemon.sh –config /etc/hadoop/conf start namenode’ returned 1 instead of one of [0] at /etc/puppet/agent/modules/hdp2/manifests/init.pp:255\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode']/Anchor[hdp2::exec::su - hdfs -c '/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode'::end] (warning): Skipping because of failed dependencies\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1]/Anchor[hdp2::exec::sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1::begin] (warning): Skipping because of failed dependencies\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1]/Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1] (warning): Skipping because of failed dependencies\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage[2]/Hdp2-hadoop::Namenode/Hdp2-hadoop::Service[namenode]/Hdp2::Exec[sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1]/Anchor[hdp2::exec::sleep 5; ls /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` >/dev/null 2>&1::end] (warning): Skipping because of failed dependencies\”",
    “\”Thu Mar 14 17:13:45 -0700 2013 /Stage

    Collapse
    #11724

    Sasha J
    Moderator

    Jeff,
    it is not clear what do you mean by this
    “(2) host005 has replicate problem,so we can’t add node”

    As of your questions:
    HDP supports all new nodes adding and nodes decommissioning.
    HMC will support nodes decommissioning in the future release.
    Looks like you mix HDP and HMC:
    HDP is Hortonworks Data Platform, the actual Hadoop and ecosystem
    HMC is Hortonworks Management Center, managing and monitoring component, which is in it’s early implementation stages.

    In general, if cluster works correctly and one of the nodes lost power and then get it back up, you do not need to do anything it will join cluster back automatically.

    However, it seems like you did something incorrectly, assuming the fact that you ask installation questions for many weeks in a row in this forum.

    Would you like to take this offline and have WebEx session to perform clean install and then simulate node failure and recovery?

    Let us know.

    Thank you!
    Sasha

    Collapse
    #11714

    Dear Ted:

    I used “Add Nodes”, and still failed.
    Bellow are the steps
    (1) Add node host005
    (2) host005 has replicate problem,so we can’t add node
    so I changed hostnmae host005 to a new name host105 , to avoid the replication problem
    (4) Add node host105
    (5) HMC failed to add nodes
    (6) Please uninstall cluster.
    ..too bad…

    It seems like Horton works HDP1.1 is not suitable for in production.
    If HDP1.1 does not support decommission,
    How can we regularly maintain the hadoop cluster nodes under HDP1.1 ?
    How can we remove a bad datanode then replace it with a new datanode ?
    What if one DataNode lost power suddenly and get back its power later ?
    How to recover this DataNode ?
    I dont know if Hortonworks HDP1.1 can handle this basic problem,
    but I do hope HDP1.1 can be can more competive in the hadoop world.

    Thanks for your great help!

    Regards,
    Jeff

    Collapse
    #11703

    tedr
    Member

    Jeff,

    Once the node has been properly, or accidentally removed from the cluster, the way to get it, or any new node, back into the cluster is to use the “add node” facility of HMC. I was assuming that you wanted to bring the node that you simulated failure on back into the cluster without treating it as a new node.

    Ted.

    Collapse
    #11700

    Dear Ted :

    I still failed to recover my DataNode host005,

    Actually and originally, I have 5 nodes(1 NameNode host001+4 DataNode host002,host003,host004,host005)in hadoop cluster. Later, I tried to simulate a failure condition of hadoop DataNode by powering off host005 node,then after powering off host005 5 minutes, then I
    (1)power up host005 .
    (2)Service hmc-agent start===>start for puppet on host005
    (3)$hadoop datanode ====>start datanode on host005
    (4)$hadoop tasktracke r===>start tasktracker on host005

    but from then on, no matter what I have done ,the host005 DataNode never back to normal condition any more.

    Phenomenon:
    1.Dashboard of HMC show
    Service State Critical Warning
    PUPPET Down 1 0
    HBASE Running 0 1
    HDFS Running 0 1
    HIVE-METASTORE Running 0 0
    MAPREDUCE Running 0 1
    OOZIE Running 0 0
    TEMPLETON Running 0 0
    ZOOKEEPER Running 0 0

    2.HDFS
    DataNodes (live/dead/decom) 3 / 1 / 0

    3.map reduce
    Trackers (live/total) 3 / 4

    Tryout2:
    Decommissioning host005 ,then add node host005 again.===>Failed too!

    I’ve also tried decommission procedure , but eventually it did not have effect.
    1.0 hdfs-stite.xml
    dfs.exlclude
    /etc/excluded.list
    1.1 excluded.list
    host006
    1.2 $hadoop dfsadmin -refreshNodes (on NameNode host001)

    1.3 $hadoop dfsadmin -report===>I’ve seen host005 decommision ok on this report

    2.0 mapred-site.xml
    mapred.exlclude
    /etc/mapred_excluded.list
    2.1 mapred_excluded.list
    host006
    2.2 $hadoop mradmin -refreshNodes (NameNode host001)

    After all,
    My question are
    1.How can I remove a malfunctioned DataNode successfully, then add a DataNode for substitution the bad nodes by HMC or other service?
    2.If currently HMC doesn’t support decommission, Is there any other method to meet my requirement ?( when a data node fails , we need to remove it in proper procedure, then add a new node for substitution)

    Much appreciated if any response.

    Regards,
    Jeff

    Collapse
    #11639

    tedr
    Member

    Jeff,

    The process should be:
    * power on the box
    * manually restart the datanode and tasktracker services (these will not start automatically when you start the computer)
    * once these services are started they should join the cluster automatically, provided that the namenode is already running.

    This is assuming that you have fixed whatever it was that caused the datanode to die in the first place. Also you should note that the proper procedure for shutting down a node, if you cannot stop the whole cluster, is to stop the hadoop services on the box first then after they have fully stopped shutdown the computer.

    Ted.

    Collapse
Viewing 9 replies - 1 through 9 (of 9 total)