I’ve had a HDP for Windows running for about a year now, building it up into a mini-cluster. We’ve got loads of things running, it’s great
the upgrade to HDP2 has brought us a new pleasure – every so often, a job will fail as the data node is unable to reach 169.254.80.80
this isn’t consistent. The same nodes can run the same jobs for hours, and it’s fine. Then, about 5-10% of the time, the job will fail because it can’t reach 169.254.80.80. This IP address falls under the “link-local address” range – http://en.wikipedia.org/wiki/Link-local_address
It’s not one machine that does it – it will happen to all the nodes trying to run the task when it does. It’s not one particular job, it can be any
it’s very odd. I’m considering using an alias for the namenode and defining that via coresite.xml and the hosts file, but that would mean manually updating loads of machines, and I’d rather do anything than that, unless I had to and I knew for certain it would work. Any ideas?
when it goes wrong:
2014-04-01 11:25:37,242 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2014-04-01 11:25:37,257 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2014-04-01 11:25:37,257 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: fs.defaultFS; Ignoring.
2014-04-01 11:25:37,710 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2014-04-01 11:25:37,803 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2014-04-01 11:25:37,803 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2014-04-01 11:25:37,819 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2014-04-01 11:25:37,819 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1395281191997_0547, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@7cf01771)
2014-04-01 11:25:37,928 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2014-04-01 11:25:44,901 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 169.254.80.80/169.254.80.80:53117. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1 SECONDS)