Home Forums HDFS Configuring a node through a socks proxy

Tagged: , ,

This topic contains 7 replies, has 2 voices, and was last updated by  Roman S 1 year, 3 months ago.

  • Creator
    Topic
  • #26701

    Roman S
    Member

    Hi,

    I am trying to configure a node to access a Hadoop cluster that is located behind a firewall. The cluster is accessible via SSH, so I have set up a socks proxy and instructed Hadoop to use it. The setup works fine, except that the node is registered to the cluster using the IP of the proxy and not its own (which makes it inaccessible to the cluster).

    To illustrate the situation here is the output of hdfs dfsadmin -report.

    # This is the node I am trying to connect to the cluster. Host name is correct, but the name and the IP are of the proxy, The node is run on a different port. Otherwise it fails to register altogether.
    Name: 192.168.254.9:51010 (narsil.xx.xx)
    Hostname: lh2-csb-09.xx.xx.xx

    # This is the node running namenode, datanode and also acts as a proxy. This is correct.
    Name: 192.168.254.9:50010 (narsil.xx.xx)
    Hostname: narsil.xx.xx

    Any pointers on how to fix this situation? Many thanks in advance.

Viewing 7 replies - 1 through 7 (of 7 total)

The topic ‘Configuring a node through a socks proxy’ is closed to new replies.

  • Author
    Replies
  • #26791

    Roman S
    Member

    Sasha,
    The idea is that local machines would be able to connect to the cluster as data nodes in a semi ad-hoc manner. Placing them inside a firewall is not feasible, as well it is not practical to open up ports. Seeing as Hadoop provides a built-in support for socks proxies, I would assume such a setup would be feasible. The current behavior of registering IP of the proxy instead of the original IP seems like a bug to me.

    Collapse
    #26771

    Sasha J
    Moderator

    Hmmm…
    Not absolutely clear why you want to have datanode outside the firewall…
    What is the practical use of it?
    Can we speak over the phone on this? Like tomorrow morning (say 10am PST)?
    I will sen you e-mail with the phone number.

    Thank you!

    Collapse
    #26750

    Roman S
    Member

    Yes, this is just what I am trying to achieve.
    There is a private network deploying a number of virtual machines (192.168.254.xx). The network can be reached via server narsil. The server has got two network interface: vlan (192.168.254.xx) and public eth0 (x.x.x.x). The server is behind a firewall and only ssh access is allowed.
    Hadoop is deployed on narsil (namenode, datanode, job tracker) and virtual machines (datanode, tasktracker). The goal is to be able to connect to this cluster as a datanode from outside world. To achieve this I have set up a socks ssh proxy (ssh -D 4200 narsil) and hadoop is configured to use this proxy.
    core-site.xml on a local datanode

    fs.defaultFS
    hdfs://192.168.254.9/

    hadoop.socks.server
    localhost:4200

    hadoop.rpc.socket.factory.class.default
    org.apache.hadoop.net.SocksSocketFactory

    hdfs-site.xml

    dfs.datanode.address
    x.x.x.x:51010

    This setup results in datanode successfully joining the cluster, but with the IP of narsil. dfsadmin report is below

    LOCAL:
    Name: 192.168.254.9:51010 (narsil.xx.xx)
    Hostname: lh2-csb-09.xx.xx.xx

    NARSIL:
    Name: 192.168.254.9:50010 (narsil.xx.xx)
    Hostname: narsil.xx.xx

    As result the local node is unreachable from the cluster. I could create a SSH tunnel back to the local machine, but it is impractical. Is there any way to get Hadoop to report the actual IP of the local node instead of the gateway?

    Many thanks in advance.

    Collapse
    #26732

    Sasha J
    Moderator

    That post you referring talking about using “gateway” .
    As farr as I understand, you are trying to have one of the datanodes outside the firewall…
    Or I miss something?
    Anyway, could you give me more details on your setup?

    Sasha

    Collapse
    #26731

    Sasha J
    Moderator

    Roman,
    Let me look closer on the case you post here.
    As a quick workaround, you can add second interface to virtual machines, or setup NAT on your hypervisor host.

    Thank you!
    Sasha

    Collapse
    #26722

    Roman S
    Member

    Sasha,
    Opening ports is not an option, as cluster nodes are in fact virtual machines not having public IPs.
    The case I am trying to solve is described here, so it should be possible. Any help is appreciated.

    Collapse
    #26717

    Sasha J
    Moderator

    Roman,
    why you want to use proxy?
    WHy not just open needed ports on the firewall in order for datanode can communicate to name node?

    Sasha

    Collapse
Viewing 7 replies - 1 through 7 (of 7 total)