Home Forums HDP on Windows – Non Installation issues Just who is 169.254.80.80 ?

This topic contains 6 replies, has 3 voices, and was last updated by  Dave 7 months ago.

  • Creator
    Topic
  • #50838

    Toby Evans
    Participant

    Hi there

    I’ve had a HDP for Windows running for about a year now, building it up into a mini-cluster. We’ve got loads of things running, it’s great

    the upgrade to HDP2 has brought us a new pleasure – every so often, a job will fail as the data node is unable to reach 169.254.80.80

    this isn’t consistent. The same nodes can run the same jobs for hours, and it’s fine. Then, about 5-10% of the time, the job will fail because it can’t reach 169.254.80.80. This IP address falls under the “link-local address” range – http://en.wikipedia.org/wiki/Link-local_address

    It’s not one machine that does it – it will happen to all the nodes trying to run the task when it does. It’s not one particular job, it can be any

    it’s very odd. I’m considering using an alias for the namenode and defining that via coresite.xml and the hosts file, but that would mean manually updating loads of machines, and I’d rather do anything than that, unless I had to and I knew for certain it would work. Any ideas?

    when it goes wrong:

    2014-04-01 11:25:37,242 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
    2014-04-01 11:25:37,257 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
    2014-04-01 11:25:37,257 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: fs.defaultFS;  Ignoring.
    2014-04-01 11:25:37,710 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
    2014-04-01 11:25:37,803 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
    2014-04-01 11:25:37,803 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
    2014-04-01 11:25:37,819 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
    2014-04-01 11:25:37,819 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1395281191997_0547, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@7cf01771)
    2014-04-01 11:25:37,928 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
    2014-04-01 11:25:44,901 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 169.254.80.80/169.254.80.80:53117. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1 SECONDS)
    
Viewing 6 replies - 1 through 6 (of 6 total)

The topic ‘Just who is 169.254.80.80 ?’ is closed to new replies.

  • Author
    Replies
  • #51115

    Dave
    Moderator

    Hi Toby,

    It’s probably your DHCP server having an issue – which is why all show this issue at the same time.
    Without really looking into your configuration and network it’s hard to say.

    If you use static IPs then you wont see this issue – also static IP’s are a best practice for hadoop.

    Thanks

    Dave

    Collapse
    #51061

    Toby Evans
    Participant

    Hi Dave,

    that makes sense, and I’m going to get some static IP addresses. But (Columbo style), just one thing … how can I be reading the updating log from a datanode reporting it’s having a network failure? Here’s the exception:

    java.net.ConnectException: Call From YEPS56563/10.51.130.248 to 169.254.80.80:64240 failed on connection exception: java.net.ConnectException: Connection refused:

    so, the datanode, yeps56563, still has it’s IP address – it’s the target one, the namenode, it seems to have “forgotten”. But only temporarily. And all the datanodes do it at the same time. Then run the exact same job 30 seconds later, and they can all run fine again. All the time, you can access all the logs about this network failure via the namenode/yarn console on 8088

    I’ll get onto my network guys

    Collapse
    #50861

    Dave
    Moderator

    Hi Toby,

    A 169 address means the computer isn’t connected to a Network.
    169.254.0.0/16 – This is the “link local” block. As described in [RFC3927], it is allocated for communication between hosts on a single link. Hosts obtain these addresses by auto-configuration, such as when a DHCP server cannot be found

    You should check your network configuration, hosts files and ensure each node can talk to each other node via its hostname.
    The machines should also have static IP addresses rather than DHCP (which it looks like you are using if you are getting 169 addresses

    Thanks

    Dave

    Collapse
    #50845

    Toby Evans
    Participant

    do you mean doing everything via static ip address, rather than named machines?

    Collapse
    #50844

    Robert Molina
    Moderator

    Hi Toby,
    Have you checked your network to see if there are any issues? Are there any errors in your nics? At the moment, you are not using dns?

    Regards,
    Robert

    Collapse
    #50840

    Toby Evans
    Participant

    here’s the stack trace of the exception:

    2014-04-01 11:27:25,943 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From YEPS72102/10.51.130.127 to 169.254.80.80:53117 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1351)
        at org.apache.hadoop.ipc.Client.call(Client.java:1300)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
        at $Proxy6.getTask(Unknown Source)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:133)
    Caused by: java.net.ConnectException: Connection refused: no further information
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:547)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
        at org.apache.hadoop.ipc.Client$Connection.access$2600(Client.java:314)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1399)
        at org.apache.hadoop.ipc.Client.call(Client.java:1318)
        ... 4 more
    
    Collapse
Viewing 6 replies - 1 through 6 (of 6 total)