Simplifying data management: NFS access to HDFS

We are excited that another critical Enterprise Hadoop integration requirement – NFS Gateway access to HDFS – is making progress through the main Apache Hadoop trunk.  This effort is architected and designed by Brandon Li and Suresh Srinivas, and is being delivered by the community. You can track progress in Apache JIRA HDFS-4750.

With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS.  NFS thus enables file-based applications to perform file read and write operations directly to Hadoop. This greatly simplifies data management in Hadoop and expands the integration of Hadoop into existing toolsets.

NFS and HDFS

Network File System (NFS) is a distributed file system protocol that allows access to files on a remote computer in a manner similar to how local file system is accessed.  With a NFS gateway for Hadoop, files can now be browsed, downloaded and written to and from HDFS as if it is local file system. These are critical enterprise requirements.

Bringing the full capability of NFS to HDFS is an important strategic initiative for us. In the first phase, we have enabled NFSv3 interface access to HDFS. This is done using NFS Gateway, a stateless daemon, that translates NFS protocol to HDFS access protocols as shown in the following diagram. Many instances of such daemon can be run to provide high throughput read/write access to HDFS from multiple clients. As a part of this work, HDFS now has a significant functionality that supports inode ID or file handles, that was done in Apache JIRA HDFS-4489.

nfs

We are excited to work with the community to enable a robust roadmap for NFS functionality, focussing on the following capabilities:

  • NFSv4 and other protocols for access to HDFS
  • Highly Available NFS Gateway
  • Secure Hadoop (Kerberos) integration

The first phase of development is complete and is undergoing rigorous testing and stabilization. This set of functionality is being run through our integrated HDP stack test suite to ensure enterprise readiness.

The NFS Gateway functionality is being made available in the community and can be tracked in JIRA HDFS-4750.

Categorized by :
HDFS

Comments

Jacinda
|
April 9, 2014 at 6:39 pm
|

How can I realiaze the whole function about throughing nfs to access HDFS.Does it only configure hdfs-default.xml? then start portmap and nfs3 service? But I cann’t start portmap service,because it stop in the open ! So please help me , Thank you very much!

Jacinda
|
April 8, 2014 at 8:15 pm
|

How can I access HDFS by NFS?I don’t know the total method!

Bhaskie
|
March 13, 2014 at 2:43 am
|

Is there any performance improvement if an application uses NFS to write data to HDFS?

Ryan Gerlach
|
January 31, 2014 at 9:04 am
|

Hi, does the NFS gateway in HDP 2.0 support clusters using Kerberos?
Ryan

    Brandon Li
    |
    February 3, 2014 at 11:12 am
    |

    Not now, but Kerberos support is on the road map and we are working on the implementation.

    Thanks,
    Brandon

      Ryan
      |
      April 9, 2014 at 4:25 am
      |

      Thanks Brandon. I am trying to track down if Kerberos support is included in HDP 2.1, do you know if it is? I had taken code from this Jira and applied it to our HDP 2.0.
      https://issues.apache.org/jira/browse/HDFS-5804
      Thanks,
      Ryan

        Brandon Li
        |
        April 9, 2014 at 10:20 am
        |

        Hi Ryan, HDP2.1 will include the Kerberos support. With HDP2.1, NFS gateway can access secure HDFS cluster as discussed in HDFS-5804.

        Thanks,
        Brandon

K Francisco
|
June 20, 2013 at 8:30 am
|

Nice. A win for ease of management and ad-hoc file access/updates.

Asim Praveen
|
May 13, 2013 at 9:32 pm
|

Can the NFS gateway be collocated on each NFS client as a special case of multiple gateways case?

    Brandon Li
    |
    August 22, 2013 at 1:35 pm
    |

    Yes. The NFS gateway machine needs everything to run an HDFS client, like Hadoop core JAR file, HADOOP_CONF directory.

    The NFS gateway can be on any DataNode, NameNode, or any HDP client machine.

Nikhil Mulley
|
May 13, 2013 at 1:20 pm
|

Another question, just a quick thought, since NFS like behaviour is being brought to hadoop, can I use semantics of exports.local and say that particular namespaces are allowed only to a particular set of named hosts, ipaddresses or group of hosts via netgroups and put some user level acls as well? If yes, I would/can achieve the filesystem exposure via nfs to be controlled and restrictive.

    Brandon Li
    |
    August 22, 2013 at 1:25 pm
    |

    Yes. JIRA HDFS-4947 is to track the effort to support export table in the NFS gateway.

Nikhil Mulley
|
May 13, 2013 at 1:15 pm
|

Hi Srinivas,

This is something of a great value addition to the hadoop infrastructure stack as such. Though I have not seen JIRA pages yet, what happened to providing mount functionality to hdfs on client machines using fuse system. Is there a difference in the functionality or in the approach (and possibly any gains?) of using nfs gateway than using mount of hdfs via fuse?

Nikhil

    Brandon Li
    |
    August 22, 2013 at 1:29 pm
    |

    There are a few problems using FUSE to provide the NFS mount for HDFS.
    Unlike NFSv3, FUSE is not inode based. FUSE usually uses path to generate the NFS file handle. Its path-handle mapping can make the host run out of memory. Even it can work around the memory problem, it could have correctness issue. FUSE may not be aware that a file’s path has been changed by other means(e.g., hadoop CLI). If FUSE is used on the client side, each NFS client has to install a client component which runs only on Linux so far.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.

Thank you for subscribing!