Home Forums HDFS recovering from failed service start

This topic contains 9 replies, has 4 voices, and was last updated by  Robert 1 year, 2 months ago.

  • Creator
    Topic
  • #14316

    Erik Nor
    Member

    I have cleaned up the sqlite data, and shutdown everything manually. So the database and the actual status of the services are in sync. But when I try to start the services, they start successfully, but puppet never gets a response from the client saying so. It eventually times out. Comparing this environment to a working one I noticed it has half as many httpd processes running as puppet and is missing the “Rack: /etc/puppet/rack” process also owned by puppet. I cannot find any documentation on that service or how to start it. Is there some service/process still not running that is not allowing the puppet clients to return their results?

    Thanks
    Erik

Viewing 9 replies - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #14651

    Robert
    Participant

    Hi Erik,
    Thanks for the feedback. This should help others who might be facing the issue.

    Regards,
    Robert

    Collapse
    #14647

    Erik Nor
    Member

    Finally found the root of the problem. When the cluster was installed and setup the host name was nyr4080101 for the master. At some point an admin made a change to the /etc/resolv.conf file which tacked a .corp.company.com on the end of the hostname when puppet was resolving it. Reverting the resolv.conf and cleaning the ssl certs for puppet eventually got the nodes all talking to each other.

    Collapse
    #14503

    tedr
    Member

    Hi Erik,

    Could you post a bit more of the puppet logs?

    Thanks,
    Ted.

    Collapse
    #14433

    Erik Nor
    Member

    yes i can ssh both directions.

    Collapse
    #14432

    tedr
    Member

    Hi Erik,

    Can you ssh to the master from the nodes that are giving you trouble?

    Thanks,
    Ted.

    Collapse
    #14387

    Erik Nor
    Member

    They are disabled on all servers. I did notice that in the puppet_apply.log file it references the master server like this;
    Puppet (debug): Processing report from master01.corp.mynetwork.com with processor Puppet::Reports::Store

    whereas in the working cluster it references it this way;
    Puppet (debug): Processing report from master01 with processor Puppet::Reports::Store

    The hostname and hostname -f both return just master01. They resolve in the hosts file and I added the master01.corp.mynetwork.com entry into the hosts file in thinking that might be the problem since it didn’t resolve without the entry. Even with the entry it still doesn’t report its status to the master.

    Collapse
    #14386

    tedr
    Member

    Hi Erik,

    Check to make sure that the firewall and SELinux are still disabled.

    Thanks,
    Ted.

    Collapse
    #14328

    Erik Nor
    Member

    I followed those steps. When starting HDFS from the UI puppet eventually times out starting the NameNode and sets the UI to failed. Checking the server I see that the NameNode successfully started and it looks like puppet tried to report that in its agent logs.

    Collapse
    #14321

    Sasha J
    Moderator

    Erik,
    In order to run HMC correctly you should have service hmc started on HMC node and service hmc-agent started on all nodes in the cluster.
    Make sure you have it started prior to starting Hadoop services.
    I believe the best for now will be:
    1. stop cluster through the UI
    2. restart all the boxes
    3. start hmc and hmc-agent on the nodes
    4. start Hadoop services using UI

    Thank you!
    Sasha

    Collapse
Viewing 9 replies - 1 through 9 (of 9 total)