Simplifying data management: NFS access to HDFS

We are excited that another critical Enterprise Hadoop integration requirement – NFS Gateway access to HDFS – is making progress through the main Apache Hadoop trunk.  This effort is architected and designed by Brandon Li and Suresh Srinivas, and is being delivered by the community. You can track progress in Apache JIRA HDFS-4750.

With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS.  NFS thus enables file-based applications to perform file read and write operations directly to Hadoop. This greatly simplifies data management in Hadoop and expands the integration of Hadoop into existing toolsets.

NFS and HDFS

Network File System (NFS) is a distributed file system protocol that allows access to files on a remote computer in a manner similar to how local file system is accessed.  With a NFS gateway for Hadoop, files can now be browsed, downloaded and written to and from HDFS as if it is local file system. These are critical enterprise requirements.

Bringing the full capability of NFS to HDFS is an important strategic initiative for us. In the first phase, we have enabled NFSv3 interface access to HDFS. This is done using NFS Gateway, a stateless daemon, that translates NFS protocol to HDFS access protocols as shown in the following diagram. Many instances of such daemon can be run to provide high throughput read/write access to HDFS from multiple clients. As a part of this work, HDFS now has a significant functionality that supports inode ID or file handles, that was done in Apache JIRA HDFS-4489.

nfs

We are excited to work with the community to enable a robust roadmap for NFS functionality, focussing on the following capabilities:

  • NFSv4 and other protocols for access to HDFS
  • Highly Available NFS Gateway
  • Secure Hadoop (Kerberos) integration

The first phase of development is complete and is undergoing rigorous testing and stabilization. This set of functionality is being run through our integrated HDP stack test suite to ensure enterprise readiness.

The NFS Gateway functionality is being made available in the community and can be tracked in JIRA HDFS-4750.

Categorized by :
HDFS

Comments

|
June 25, 2014 at 7:53 am
|

I would like to integrate HDFS and NFS such that I can create analytic pipelines using use open source tools that are not Hadoop aware (they depend on the NFS system) and then do the translation into HDFS. Does the NFS Gateway access to HDFS provide this ability? Are there drivers, modules available? Thanks!

    Brandon Li
    |
    June 25, 2014 at 11:43 am
    |

    There is no extra drivers/modules needed. After you start HDFS and NFS gateway, you can mount HDFS export as regular NFS export. In term of data ingestion, random write to an existing file is not supported yet. File append is supported.

    Thanks,
    Brandon

Hari
|
June 11, 2014 at 11:13 pm
|

Can I use NFS over HDFS for running the Virtual Machines in XenServer/VMWare

    Brandon Li
    |
    June 16, 2014 at 2:11 pm
    |

    If you want to mount the NFS export on the virtual machine, yes, you can. It’s no different with mounting it on a physical Linux box.

    Please let me know if I misunderstood your question.

    Thanks,
    Brandon

Preetham
|
May 6, 2014 at 11:23 pm
|

Do we have to set up UID in NFS Server to implement AUTH_UNIX? What are the configuration changes required to implement AUTH_UNIX

    Brandon Li
    |
    May 8, 2014 at 9:58 am
    |

    AUTH_UNIX is usually the default NFS security policy and you don’t have to do anything special.

    If the user accounts in your cluster is managed by name services as LDAP/NIS, the UID should be the same for the same user on both the client and server.

    Thanks,
    Brandon

Eric W
|
May 6, 2014 at 12:34 pm
|

Would it be crazy to install an NFS gateway on each DataNode and use a load balancer (TCP or DNS RR) to spread NFS connections across the entire cluster?

    Brandon Li
    |
    May 6, 2014 at 2:07 pm
    |

    Actually starting NFS gateway on multiple DataNodes is one way for increased throughput. For example, using each gateway to load/download different batch of files.

    Since HDFS doesn’t support multiple writers, spreading writes of the same file to multiple NFS gateway will be a problem. However, reads should be fine.

Eric W
|
May 6, 2014 at 11:53 am
|

Is it crazy for me to consider installing an NFS gateway on each DataNode and using a load balancing mechanism (TCP or DNS RR) to spread requests across the cluster?

Jacinda
|
April 9, 2014 at 6:39 pm
|

How can I realiaze the whole function about throughing nfs to access HDFS.Does it only configure hdfs-default.xml? then start portmap and nfs3 service? But I cann’t start portmap service,because it stop in the open ! So please help me , Thank you very much!

Jacinda
|
April 8, 2014 at 8:15 pm
|

How can I access HDFS by NFS?I don’t know the total method!

Bhaskie
|
March 13, 2014 at 2:43 am
|

Is there any performance improvement if an application uses NFS to write data to HDFS?

Ryan Gerlach
|
January 31, 2014 at 9:04 am
|

Hi, does the NFS gateway in HDP 2.0 support clusters using Kerberos?
Ryan

    Brandon Li
    |
    February 3, 2014 at 11:12 am
    |

    Not now, but Kerberos support is on the road map and we are working on the implementation.

    Thanks,
    Brandon

      Ryan
      |
      April 9, 2014 at 4:25 am
      |

      Thanks Brandon. I am trying to track down if Kerberos support is included in HDP 2.1, do you know if it is? I had taken code from this Jira and applied it to our HDP 2.0.
      https://issues.apache.org/jira/browse/HDFS-5804
      Thanks,
      Ryan

        Brandon Li
        |
        April 9, 2014 at 10:20 am
        |

        Hi Ryan, HDP2.1 will include the Kerberos support. With HDP2.1, NFS gateway can access secure HDFS cluster as discussed in HDFS-5804.

        Thanks,
        Brandon

K Francisco
|
June 20, 2013 at 8:30 am
|

Nice. A win for ease of management and ad-hoc file access/updates.

Asim Praveen
|
May 13, 2013 at 9:32 pm
|

Can the NFS gateway be collocated on each NFS client as a special case of multiple gateways case?

    Brandon Li
    |
    August 22, 2013 at 1:35 pm
    |

    Yes. The NFS gateway machine needs everything to run an HDFS client, like Hadoop core JAR file, HADOOP_CONF directory.

    The NFS gateway can be on any DataNode, NameNode, or any HDP client machine.

      Evan Zhou
      |
      June 11, 2014 at 2:14 pm
      |

      Can we setup muti NFS gateway services?
      Like, we have several clients and they could connect to different services. In this way, the read/write throughput could be improved.

        Brandon Li
        |
        June 11, 2014 at 2:38 pm
        |

        Yes. You can start multiple NFS gateways on DataNode or Client node to improve throughput.
        Also, each gateway can export a different directory by configuring “dfs.nfs3.export.point”(renamed to “nfs.export.point” in Hadoop2.5 and later releases). By default, the only export is “/”.

        Thanks,
        Brandon

Nikhil Mulley
|
May 13, 2013 at 1:20 pm
|

Another question, just a quick thought, since NFS like behaviour is being brought to hadoop, can I use semantics of exports.local and say that particular namespaces are allowed only to a particular set of named hosts, ipaddresses or group of hosts via netgroups and put some user level acls as well? If yes, I would/can achieve the filesystem exposure via nfs to be controlled and restrictive.

    Brandon Li
    |
    August 22, 2013 at 1:25 pm
    |

    Yes. JIRA HDFS-4947 is to track the effort to support export table in the NFS gateway.

Nikhil Mulley
|
May 13, 2013 at 1:15 pm
|

Hi Srinivas,

This is something of a great value addition to the hadoop infrastructure stack as such. Though I have not seen JIRA pages yet, what happened to providing mount functionality to hdfs on client machines using fuse system. Is there a difference in the functionality or in the approach (and possibly any gains?) of using nfs gateway than using mount of hdfs via fuse?

Nikhil

    Brandon Li
    |
    August 22, 2013 at 1:29 pm
    |

    There are a few problems using FUSE to provide the NFS mount for HDFS.
    Unlike NFSv3, FUSE is not inode based. FUSE usually uses path to generate the NFS file handle. Its path-handle mapping can make the host run out of memory. Even it can work around the memory problem, it could have correctness issue. FUSE may not be aware that a file’s path has been changed by other means(e.g., hadoop CLI). If FUSE is used on the client side, each NFS client has to install a client component which runs only on Linux so far.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.