Home Forums HDP on Linux – Installation HDP 1.3 installation from RPMs

This topic contains 6 replies, has 3 voices, and was last updated by  Seth Lyubich 8 months ago.

  • Creator
    Topic
  • #30990

    Tunde Balint
    Member

    I’ve installed HDP 1.3 on a RHEL cluster :
    - 1 namenode – 192.168.100.131
    - 1 secondary namenode+jobracker -192.168.100.132
    - 4 datanodes/tasktrackers – 192.168.100.133-136
    I’ve disabled all the firewalls, set ulimit to 32768 for mapred/hdfs/hadoop users and set the dfs.datanode.max.xcievers to 4096. When I try to copy or retrieve a file from HDFS it works, and the file is replicated (I’ve checked the log of the namenode and fsck). When I try to run a simple MR job (hadoop jar /usr/lib/hadoop/hadoop-examples.jar sleep -m 1 -r 1) I get a lot of warnings:

    13/08/06 18:15:01 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_4230015220173267878_1043java.io.IOException: Bad response 1 for block blk_4230015220173267878_1043 from datanode 192.168.100.134:50010
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3379)

    and sometimes the job starts and finishes after a fairly long time and sometimes it doesn’t even start giving me the following error: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read
    In the namenode log I see:

    2013-08-06 18:15:06,243 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: PendingReplicationMonitor timed out block blk_1864448312313307039_1029
    2013-08-06 18:15:06,243 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: PendingReplicationMonitor timed out block blk_5881133419735879626_1034
    2013-08-06 18:15:09,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.100.135:50010 to replicate blk_5881133419735879626_1034 to datanode(s) 192.168.100.134:50010 192.168.100.133:50010

    On the datanodes I see errors like:

    ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.100.135:50010, storageID=DS-1865011095-192.168.100.135-50010-1375800118638, infoPort=50075, ipcPort=8010):DataXceiver
    org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-7183920363775195741_1024 has already been started (though not completed), and thus cannot be created.

    My machines have multiple ethernet interfaces, but since HDFS put/get works I do not think that the network is a problem. When I do a hadoop fsck it says that the filesystem is healthy and shows a few underreplicated blocks in /user/hadoop/.staging/
    I already uninstalled and reinstalled HDP, deleted all the datanode data directories and reformatted the namenode…but it didn’t help.
    Could anybody tell me what I should check or what would fix my problem?

    Kind regards,
    Tunde Balint

Viewing 6 replies - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #32875

    Seth Lyubich
    Keymaster

    Hi Tunde,

    Thanks for letting us know that the issue is resolved. We will consider enhancing error logging for such issues.

    Thanks,
    Seth

    Collapse
    #32702

    Tunde Balint
    Member

    Hi Sasha,
    It turned out to be a network problem: the namenode and jobtracker had MTU set to 1500 and the datanodes had MTU set to 9000. When we set everything to the same value, the error disappeared.

    Thanks for your help!
    Tunde

    Collapse
    #31390

    Tunde Balint
    Member

    Hi Sasha,

    I am not using LDAP and I gave up with the ambari server installation, as it was just a test to see if I can get the cluster working properly. My goal is to install the cluster with RPMs.

    I did make some progress. I tried to play around with the interfaces and I noticed that the cluster and the mapreduce jobs work properly if I install them using eth0. If I try to make the hadoop traffic use eth1 or eth2, then I get the error described initially.

    Do you know if there is something that I need to set to force the mapreduce jobs to use a different interface?

    Best regards,
    Tunde

    Collapse
    #31255

    Sasha J
    Moderator

    I think there are some problems with your networking configuration…
    Are you using LDAP or similar?
    Could you reset ambari server , wipe our logs and then start it back again?
    If error still exist, please post the whole ambari-server log.
    Also, it will be very useful if you provide your system configuration details…

    Thank you!
    Sasha

    Collapse
    #31224

    Tunde Balint
    Member

    Hi Sasha,

    I have NTPD running, so I check, time is ok.
    And I cannot disable the rest of the interfaces.

    I though that I will just try to automatic installation with Ambari, and then I encountered another issue: Ambari server starts, I create an SSH tunnel to the machine, I get the login screen but when I try to log in with admin/admin then I just get the login screen back.

    In the log files of the server I get:

    11:54:43,761 INFO AmbariLocalUserDetailsService:62 - Loading user by name: admin
    11:54:44,691 INFO AmbariLocalUserDetailsService:62 - Loading user by name: nagiosadmin
    11:54:44,693 INFO AmbariLocalUserDetailsService:67 - user not found

    Best,
    Tunde

    Collapse
    #31181

    Sasha J
    Moderator

    Tunde,
    hard to say right away…
    Did you check that time is in sync on all nodes?
    Could you temporarily disable all the extra NICs on your boxes and run your test again?

    It may be related to multiple network interfaces….

    Thank you!
    Sasha

    Collapse
Viewing 6 replies - 1 through 6 (of 6 total)