cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
May 13, 2013
prev slideNext slide

Simplifying data management: NFS access to HDFS

We are excited that another critical Enterprise Hadoop integration requirement – NFS Gateway access to HDFS – is making progress through the main Apache Hadoop trunk.  This effort is architected and designed by Brandon Li and Suresh Srinivas, and is being delivered by the community. You can track progress in Apache JIRA HDFS-4750.

With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS.  NFS thus enables file-based applications to perform file read and write operations directly to Hadoop. This greatly simplifies data management in Hadoop and expands the integration of Hadoop into existing toolsets.

NFS and HDFS

Network File System (NFS) is a distributed file system protocol that allows access to files on a remote computer in a manner similar to how local file system is accessed.  With a NFS gateway for Hadoop, files can now be browsed, downloaded and written to and from HDFS as if it is local file system. These are critical enterprise requirements.

Bringing the full capability of NFS to HDFS is an important strategic initiative for us. In the first phase, we have enabled NFSv3 interface access to HDFS. This is done using NFS Gateway, a stateless daemon, that translates NFS protocol to HDFS access protocols as shown in the following diagram. Many instances of such daemon can be run to provide high throughput read/write access to HDFS from multiple clients. As a part of this work, HDFS now has a significant functionality that supports inode ID or file handles, that was done in Apache JIRA HDFS-4489.

nfs

We are excited to work with the community to enable a robust roadmap for NFS functionality, focussing on the following capabilities:

  • NFSv4 and other protocols for access to HDFS
  • Highly Available NFS Gateway
  • Secure Hadoop (Kerberos) integration

The first phase of development is complete and is undergoing rigorous testing and stabilization. This set of functionality is being run through our integrated HDP stack test suite to ensure enterprise readiness.

The NFS Gateway functionality is being made available in the community and can be tracked in JIRA HDFS-4750.

Categories:

Comments

  • Hi Srinivas,

    This is something of a great value addition to the hadoop infrastructure stack as such. Though I have not seen JIRA pages yet, what happened to providing mount functionality to hdfs on client machines using fuse system. Is there a difference in the functionality or in the approach (and possibly any gains?) of using nfs gateway than using mount of hdfs via fuse?

    Nikhil

    • There are a few problems using FUSE to provide the NFS mount for HDFS.
      Unlike NFSv3, FUSE is not inode based. FUSE usually uses path to generate the NFS file handle. Its path-handle mapping can make the host run out of memory. Even it can work around the memory problem, it could have correctness issue. FUSE may not be aware that a file’s path has been changed by other means(e.g., hadoop CLI). If FUSE is used on the client side, each NFS client has to install a client component which runs only on Linux so far.

  • Another question, just a quick thought, since NFS like behaviour is being brought to hadoop, can I use semantics of exports.local and say that particular namespaces are allowed only to a particular set of named hosts, ipaddresses or group of hosts via netgroups and put some user level acls as well? If yes, I would/can achieve the filesystem exposure via nfs to be controlled and restrictive.

    • Yes. The NFS gateway machine needs everything to run an HDFS client, like Hadoop core JAR file, HADOOP_CONF directory.

      The NFS gateway can be on any DataNode, NameNode, or any HDP client machine.

      • Can we setup muti NFS gateway services?
        Like, we have several clients and they could connect to different services. In this way, the read/write throughput could be improved.

        • Yes. You can start multiple NFS gateways on DataNode or Client node to improve throughput.
          Also, each gateway can export a different directory by configuring “dfs.nfs3.export.point”(renamed to “nfs.export.point” in Hadoop2.5 and later releases). By default, the only export is “/”.

          Thanks,
          Brandon

          • If I have multiple clients and multiple NFS gateways, I can mount gateway 1 to client 1-3, gateway 2 to client 4-6, but it is not managable for client sides. Does a load balancer work with multiple NFS gateways?

            Another question: Which one is faster, NFS gateway or HttpFS?

          • Currently there is no built-in load balancer for NFS gateway. The client needs to manage the load on each mount point if multiple NFS gateways’ exports are mounted on the same client.

            Regarding the performance, I think it depends on the workload. NFS gateway works well with large amount of small file manipulations.

  • How can I realiaze the whole function about throughing nfs to access HDFS.Does it only configure hdfs-default.xml? then start portmap and nfs3 service? But I cann’t start portmap service,because it stop in the open ! So please help me , Thank you very much!

  • Is it crazy for me to consider installing an NFS gateway on each DataNode and using a load balancing mechanism (TCP or DNS RR) to spread requests across the cluster?

  • Would it be crazy to install an NFS gateway on each DataNode and use a load balancer (TCP or DNS RR) to spread NFS connections across the entire cluster?

    • Actually starting NFS gateway on multiple DataNodes is one way for increased throughput. For example, using each gateway to load/download different batch of files.

      Since HDFS doesn’t support multiple writers, spreading writes of the same file to multiple NFS gateway will be a problem. However, reads should be fine.

  • Do we have to set up UID in NFS Server to implement AUTH_UNIX? What are the configuration changes required to implement AUTH_UNIX

    • AUTH_UNIX is usually the default NFS security policy and you don’t have to do anything special.

      If the user accounts in your cluster is managed by name services as LDAP/NIS, the UID should be the same for the same user on both the client and server.

      Thanks,
      Brandon

    • If you want to mount the NFS export on the virtual machine, yes, you can. It’s no different with mounting it on a physical Linux box.

      Please let me know if I misunderstood your question.

      Thanks,
      Brandon

  • I would like to integrate HDFS and NFS such that I can create analytic pipelines using use open source tools that are not Hadoop aware (they depend on the NFS system) and then do the translation into HDFS. Does the NFS Gateway access to HDFS provide this ability? Are there drivers, modules available? Thanks!

    • There is no extra drivers/modules needed. After you start HDFS and NFS gateway, you can mount HDFS export as regular NFS export. In term of data ingestion, random write to an existing file is not supported yet. File append is supported.

      Thanks,
      Brandon

  • Hello,
    Can someone explain me how NFS will work with HA cluster. In failover scenario ? More specific :
    I have mounted my resources like that:
    mount -t nfs -o vers=3,proto=tcp,nolock $server:/ $mount_point where $server is my Active NameNode ip.
    Everything works fine. but suddenly some kind of error happens and my StandBy NameNode become Active NameNode ,
    and my mount won`t work , im right ?
    If im wrong correct me with some doc`s
    Best regards , Daniel

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>