Home Forums HDP on Linux – Installation nagios alerts for cpu utilization

This topic contains 16 replies, has 3 voices, and was last updated by  tedr 1 year, 5 months ago.

  • Creator
    Topic
  • #29102

    Jason Morse
    Participant

    We just installed a 16 node cluster and are getting nagios alerts for cpu utilization on the hbase master, jobtracker and namenode. The alerts are saying nagios cannot get a reading. What would cause this problem? Also the regionserver process will not start on one of the hosts. It says connection refused.

Viewing 16 replies - 1 through 16 (of 16 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #29609

    tedr
    Moderator

    Hi Jason,

    Thanks for letting us know that this was due to a configuration file being overwritten.

    Thanks,
    Ted.

    Collapse
    #29606

    Jason Morse
    Participant

    Looks like our puppet scripts removed entries from the snmp.conf file that Ambari wrote in. After adding the entry recommended by Paul the alerts cleared. Thanks for the help!

    Collapse
    #29419

    tedr
    Moderator

    Hi Jason,

    Are these nagios notifications all CPU utilization or are they other things now? The reason I ask is that the cpu utilization bit looks like it may be a bug, as we are now getting other reports like this.

    Thanks,
    Ted.

    Collapse
    #29246

    Jason Morse
    Participant

    Ok the alert just went away for the HBase Service for about 20 mins and now just came back. It did not go away for the NameNode or JobTracker which are on the same host.

    Collapse
    #29245

    Jason Morse
    Participant

    Jumped up a little just now.

    $ uptime
    15:27:10 up 1 day, 1:03, 1 user, load average: 0.12, 0.05, 0.01

    Collapse
    #29244

    Jason Morse
    Participant

    From one of the other boxes this is the uptime command.

    $ uptime
    15:22:20 up 10 days, 22:04, 1 user, load average: 0.29, 0.23, 0.23

    The top line of the top command is this. Also when I run uptime right after it reflects these same numbers. It must have just happened to be at 0 when I ran it last time. It does seem to be much lower than the other host though. the highest number I have seen is .07 and it is usually at .03 like in the below output.

    top – 15:24:30 up 1 day, 1:00, 1 user, load average: 0.03, 0.03, 0.00

    Collapse
    #29242

    tedr
    Moderator

    Hi Jason,

    I was aware the the uptime report was only one line. I wanted to see the numbers after the ‘load average’ to make sure that the system didn’t think that it was over loaded. From the look of what you pasted below this is definitely not the case as it looks like it has no load at all. Though thinking about it maybe the numbers we see are the very reason Nagios is complaining it sees only zeros and with 24 cpus per box it should have a number of some kind there. Do you get the same load average numbers at the first part of a ‘top’ command? Also do you get the same load averages on other boxes in your cluster or do they show a non-zero number?

    Thanks,
    Ted.

    Collapse
    #29240

    Jason Morse
    Participant

    Uptime is just one line. In Ambari though it is showing 24 cpu’s per host.

    $ uptime
    14:20:48 up 23:57, 1 user, load average: 0.00, 0.00, 0.00

    Collapse
    #29236

    tedr
    Moderator

    Hi Jason,

    I wanted to know the tail end of the uptime report, the bit where it shows the cpu utilization as the system sees it. Basically if these numbers divided by the number of cores are larger than 1 then the system itself thinks it is overloaded.

    Thanks,
    Ted.

    Collapse
    #29227

    Jason Morse
    Participant

    uptime is 23 hours since we rebooted the box yesterday. Box has 2 physical 6 core cpu’s.

    Collapse
    #29223

    tedr
    Moderator

    Hi Jason,

    On the box that is having the errors what is the output of ‘uptime’ and how many cpu’s does that box have?

    Thanks,
    Ted.

    Collapse
    #29161

    Jason Morse
    Participant

    It is only on one of the master nodes. This node is running the NameNode and JobTracker. We tried rebooting the host but it still is throwing the error.

    Collapse
    #29158

    tedr
    Moderator

    Hi Jason,

    Are you getting these alerts on all of the nodes? You can speed up the rechecking of the tests that resulted in the error by going to the nagios homepage of your cluster and selecting ‘services’ then select one of the failing tests and select “reschedule the next test” from the right menu.

    Thanks,
    Ted.

    Collapse
    #29147

    Jason Morse
    Participant

    We resolved the issue with the regionserver process by rebooting the host. We are still having the cpu utilization notification issue from nagios though. I will continue to research.

    Collapse
    #29135

    Jason Morse
    Participant

    Thanks for your reply. I think the regionserver problem is due to a port issue and am looking into it. As far as the cpu utilization problem. The only thing I’ve been able to find is that snmp might now be running but it is running on the server. Can you think of anything that would cause nagios to not get a cpu reading? The error has persisted since yesterday.

    Collapse
    #29105

    Sasha J
    Moderator

    Check for the firewalls running on the nodes.
    Also, Nagios have to run for some time, until it have all checks executed.
    Checks runs sequentially and on some schedule (like 1 minute and 5 minutes).
    So, for 16 nodes cluster you have to wait for about an hour to have all check executed and reported correctly.
    On first start Nagios may generate a bunch of “false” alarms….

    Thank you!
    Sasha

    Collapse
Viewing 16 replies - 1 through 16 (of 16 total)