Home Forums HDP on Linux – Installation Can't restart cluster – ambari not proving useful

This topic contains 5 replies, has 3 voices, and was last updated by  Alejandro Fernandez 3 months, 3 weeks ago.

  • Creator
    Topic
  • #58050

    Brian Greeson
    Participant

    Hey guys, I’m running a hdp 2.1 cluster on a set of machines running CENTOS 6.5

    I successfully installed the cluster ( a few times now), but I am running into consistent issues when restarting the machines that make up ther cluster. I’d like to resolve the issue and get my cluster running again, as well figure out a procedure for avoiding the issue in the future

    I made sure to firstly, run stop all from the ambari interface. Then, I shut down all the systems, including the system running ambari-server. Is there anything wrong with this procedure? Should I run ambari-server stop on the ambari-server system prior to shutting that machine down??

    Anyhow, I powered on all the systems, and waited to ensure they are all up. Then, I contact the web interface on the ambari-server machine. I log in. I run the start all, and it consistently fails, usually around 11 or 12 seconds.

    The odd thing is, the nodes, when you look at the individual services, sometimes a failed service will have the red circle with exclamation point and sometimes it will have the yellow bar. It’s not really clear what either of those mean, since there is no key, but I assume the red is a complete failure.

    When I click on any of the failed services, I get messages indicating that the service failed to install, although it was previously installed and I never had wanted to install it again. I only wanted to start my cluster.

    I would attach a screenshot of one of the task lists for a node, but I don’t see any option do so.

    I can paste the stdout and stderr for one of the services on a particular node which is red if that will be helpful?

    What does the yellow line mean as comapred to the red circle??

    Thanks,
    -Brian

Viewing 5 replies - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #58141

    Alejandro Fernandez
    Participant

    Hi Brian, are you able to reply with how much RAM was allocated to your agent, and the memory block size?
    I found some related issues on the Red Hat Bugzilla page that may point to the RAM being less than 1GB, and/or the memory block size being less than 4KB.

    https://bugzilla.redhat.com/show_bug.cgi?id=680508

    https://bugzilla.redhat.com/show_bug.cgi?id=923201

    https://bugzilla.redhat.com/show_bug.cgi?id=1033013

    Collapse
    #58106

    Brian Greeson
    Participant

    Upon shutting down the cluster via stop all, then powering off the machines. Attempting to start the cluster again, I’ve encountered the same issues with the yum databases. Any clue?

    Collapse
    #58093

    Brian Greeson
    Participant

    Hi Jeff,

    I was able to successfully restart the cluster after resolving an issue with the yum database that has occurred on all master and slave nodes.

    The nature of the error messages was something like ” failed to install service X”
    I’m assuming what happened is this:
    Ambari used yum to check if packages existed
    – Yum is broken
    Since yum is broken Ambari assumes packages are missing
    – Ambari attempts to use yum (which is broken) to install missing packages….fails

    Doing the following on all affected nodes fixed my yum issues
    # rm -f /var/lib/rpm/__db*
    # rpm –rebuilddb
    # yum update

    However, the question remains, what caused this? I’ve only successfully installed the cluster. Started it, stopped it. Then I shutdown the nodes and powered them up. That’s when this issue manifested. It seems to me that ambari must be the cause, then.

    Any thoughts?

    Thanks again,
    -Brian

    Collapse
    #58092

    Brian Greeson
    Participant

    Hi Jeff,

    Thanks for the response. I will report back with that information. However, I’ve noticed one other thing firstly. I’ve noticed that the yum database seems to be corrupted on the nodes, any idea what could have caused this? I feel like this could be causing the errors, so I want to correct that and then I’ll let you know if I still have issues.

    Collapse
    #58053

    Jeff Sposetti
    Moderator

    Please post the stdout and stderr from the Start All. Just pick a host and a component task where you see a red exclamation failure.

    Red exclamation means the task failed.

    The yellow with a bar means the task was cancelled. If a master component start task fails (red exclamation), ambari will cancel the remaining tasks (so not to bother with attempting the perform the tasks since a master component that task is dependent on failed.

    If you try to start services individually (and not Start All), how does that work? Start with HDFS > Start, and so on.

    Collapse
Viewing 5 replies - 1 through 5 (of 5 total)