Home Forums HDP on Linux – Installation How to fix HMC node only without reinstlling a new cluster?

This topic contains 7 replies, has 3 voices, and was last updated by  tedr 1 year, 7 months ago.

  • Creator
    Topic
  • #13179

    Hi all :

    My HMC node failed with “JSON Parse failed” , it is caused by AC 110V power suddenly lost last night.
    so I reboot my hadoop cluster (5 nodes)this morning,and hope it can come to live again.
    Unfortunately I found my HMC web page with “JSON Parse failed”
    I have no idea about how to fix ” JSON Parse failed” ,so I chose to “remove hmc and install hmc” then,

    After
    ” # yum remove hmc ;
    #yum install hmc ;
    #service hmc start ,”
    I found my HMC web page went to Hortonworks Welcome page
    “Welcome to Hortonworks Management Center ” with a button labeled “get started”
    that is a sign of install new cluster.
    I did not mean to reinstall all the cluster , instead, I just want to fix my HMC due to my HMC got stuck with “JSON Parse failed” that is caused by suddenly power loast last night.

    Question 1:
    Anybody know how to safely recover my HMC without having to reinstall a new cluster?
    Any method to fix HMC without reinstalling the hadoop cluster if HMC fail again with some cause in the future?

    Regards,

    Jeff

Viewing 7 replies - 1 through 7 (of 7 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #13250

    tedr
    Member

    Hi Jeff,

    On host006 can you send us the datanode logs? This will enable us to see if the datanode jhas a problem starting up on that server. This log is usually located in the /var/log/hadoop/hdfs directory and will have ‘datanode’ in the filename.

    Thanks,
    Ted.

    Collapse
    #13237

    Hi tedr:

    Thanks for your help.
    After the re-install cluster, I met a new problem,
    See as the following description:
    (1) Reinstall Centos 6.3 with 10 nodes
    (2) prepare the BIT
    (3) service HMC start
    (4)deploy cluster
    3 master nodes
    host011 NameNode ok,
    host012 SNN ok+Jobtracker ok
    host013 HBase ok
    7 slave nodes:
    host001 :HMC ok+Tasktracker ok+DataNode ok +RegionServer ok
    host002:Tasktracker ok+DataNode ok +RegionServer ok
    host003:Tasktracker ok+DataNode ok +RegionServer ok
    host004:Tasktracker ok+DataNode ok +RegionServer ok
    host006:Tasktracker ok+RegionServer ok
    host011:Tasktracker ok+DataNode +RegionServer ok
    host012:Tasktracker ok+DataNode ok +RegionServer ok

    if you press #jps in the host006, you can only see Tasktracker and
    Regionserver, no DataNode,
    [root@host006 ~]# jps
    32329 HRegionServer
    6938 Jps
    31474 TaskTracker

    7 slaves with 6 datanodes+7 jobtrackers+7 Regionserver
    Suppose there should be 7 datanodes in the 10 nodes
    Why there are only 6 data nodes no mater I un-install and re-install 3 more times
    I doubt that is it only a network issue ? or a HMC bug?

    ps. Nagios aleart
    [01-08-2013 00:03:08] Caught SIGTERM, shutting down…
    Service Warning[01-08-2013 00:02:53] SERVICE ALERT: host011.dmo.com;HDFS::Percent DataNodes down;WARNING;HARD;3;WARNING: total:, affected:
    Service Critical[01-08-2013 00:02:43] SERVICE ALERT: host006.dmo.com;DATANODE::Process down;CRITICAL;SOFT;2;Connection refused
    Service Warning[01-08-2013 00:02:33] SERVICE ALERT: host011.dmo.com;HDFS::Percent DataNodes down;WARNING;SOFT;2;WARNING: total:, affected:
    Service Warning[01-08-2013 00:02:23] SERVICE ALERT: host011.dmo.com;HDFS::Percent DataNodes down;WARNING;SOFT;1;WARNING: total:, affected:
    Service Critical[01-08-2013 00:02:13] SERVICE ALERT: host006.dmo.com;DATANODE::Process down;CRITICAL;SOFT;1;Connection refused
    Program Start[01-08-2013 00:01:03] Nagios 3.2.3 starting… (PID=12062)
    Program End[01-08-2013 00:01:02] Caught SIGTERM, shutting down…
    Program Start[01-08-2013 00:01:01] Nagios 3.2.3 starting… (PID=11985)

    Regards,
    Jeff

    Collapse
    #13227

    tedr
    Member

    Hi Jeff,

    How did the re-install of the cluster go?

    Thanks,
    Ted.

    Collapse
    #13209

    Dear Larry:

    Thanks! I’ll install a new cluster.

    Jeff

    Collapse
    #13195

    Larry Liu
    Moderator

    Hi, Jeff

    I think it is a good idea to reinstall the HMC to have a clean installation.

    Larry

    Collapse
    #13191

    Hi Larry:

    Thanks for your help ,
    I think your info will be very helpful if I met the same problem in the future,
    but unfortunately I currently encountered the following problem while trying to fix my HMC.

    [root@host011 /]# sqlite3 /var/db/hmc/data/data.db
    SQLite version 3.6.20
    Enter “.help” for instructions
    Enter SQL statements terminated with a “;”
    sqlite> update hostroles set state=’STOPPED’ where state=’STOPPING’;
    Error: no such column: ’STOPPED’
    sqlite> update hostroles set state=’STOPPED’ where state=’STARTED’;
    Error: no such column: ’STOPPED’
    sqlite> update hostroles set state=’STOPPED’ where state=’FAILED’;
    Error: no such column: ’STOPPED’
    sqlite> update serviceinfo set state=’STOPPED’ where state=’STOPPING’;
    Error: no such column: ’STOPPED’
    sqlite> update serviceinfo set state=’STOPPED’ where state=’STARTED’;
    Error: no such column: ’STOPPED’
    sqlite> update serviceinfo set state=’STOPPED’ where state=’FAILED’;
    Error: no such column: ’STOPPED’
    sqlite> update servicecomponentinfo set state=’STOPPED’ where state=’STOPPING’;
    Error: no such column: ’STOPPED’
    sqlite> update servicecomponentinfo set state=’STOPPED’ where state=’STARTED’;
    Error: no such column: ’STOPPED’
    sqlite> update servicecomponentinfo set state=’STOPPED’ where state=’FAILED’;
    Error: no such column: ’STOPPED’

    It seemed that the original database /var/db/hmc/data/data.db had been removed by me yesterday because I had done the following steps :
    ” # yum remove hmc ;
    #yum install hmc ;
    #service hmc start ,”
    yesterday.

    Now What else should I do ?

    Is the only way “Reinstall a new cluster” ?
    ===========================================
    database info
    ============================================
    [root@host011 /]# sqlite3 /var/db/hmc/data/data.db “.table”
    Clusters ServiceComponents
    ConfigHistory ServiceConfig
    ConfigProperties ServiceDependencies
    HostRoleConfig ServiceInfo
    HostRoles Services
    Hosts SubTransactionStatus
    ServiceComponentDependencies TransactionStatus
    ServiceComponentInfo
    [root@host011 /]# sqlite3 /var/db/hmc/data/data.db “PRAGMA table_info(Hostroles)”
    0|role_id|INTEGER|0||1
    1|cluster_name|TEXT|0||0
    2|host_name|TEXT|0||0
    3|component_name|TEXT|0||0
    4|state|TEXT|0||0
    5|desired_state|TEXT|0||0

    Regards,
    Jeff

    Collapse
    #13187

    Larry Liu
    Moderator

    Hi, Jeff

    Thanks for trying HMC.

    Please try the following workaround:

    On your HMC installed box, run:
    sqlite3 /var/db/hmc/data/data.db

    From sqlite prompt, issue the following commands:
    update hostroles set state=’STOPPED’ where state=’STOPPING’;
    update hostroles set state=’STOPPED’ where state=’STARTED’;
    update hostroles set state=’STOPPED’ where state=’FAILED’;
    update serviceinfo set state=’STOPPED’ where state=’STOPPING’;
    update serviceinfo set state=’STOPPED’ where state=’STARTED’;
    update serviceinfo set state=’STOPPED’ where state=’FAILED’;
    update servicecomponentinfo set state=’STOPPED’ where state=’STOPPING’;
    update servicecomponentinfo set state=’STOPPED’ where state=’STARTED’;
    update servicecomponentinfo set state=’STOPPED’ where state=’FAILED’;

    Hope this helps.

    Thanks

    Larry

    Collapse
Viewing 7 replies - 1 through 7 (of 7 total)