Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.
Thank you for subscribing!
Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.
Thank you for subscribing!
We are excited that another critical Enterprise Hadoop integration requirement – NFS Gateway access to HDFS – is making progress through the main Apache Hadoop trunk. This effort is architected and designed by Brandon Li and Suresh Srinivas, and is being delivered by the community. You can track progress in Apache JIRA HDFS-4750.
With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS. NFS thus enables file-based applications to perform file read and write operations directly to Hadoop. This greatly simplifies data management in Hadoop and expands the integration of Hadoop into existing toolsets.
Network File System (NFS) is a distributed file system protocol that allows access to files on a remote computer in a manner similar to how local file system is accessed. With a NFS gateway for Hadoop, files can now be browsed, downloaded and written to and from HDFS as if it is local file system. These are critical enterprise requirements.
Bringing the full capability of NFS to HDFS is an important strategic initiative for us. In the first phase, we have enabled NFSv3 interface access to HDFS. This is done using NFS Gateway, a stateless daemon, that translates NFS protocol to HDFS access protocols as shown in the following diagram. Many instances of such daemon can be run to provide high throughput read/write access to HDFS from multiple clients. As a part of this work, HDFS now has a significant functionality that supports inode ID or file handles, that was done in Apache JIRA HDFS-4489.
We are excited to work with the community to enable a robust roadmap for NFS functionality, focussing on the following capabilities:
The first phase of development is complete and is undergoing rigorous testing and stabilization. This set of functionality is being run through our integrated HDP stack test suite to ensure enterprise readiness.
The NFS Gateway functionality is being made available in the community and can be tracked in JIRA HDFS-4750.
5.31.16
8.19.15
7.29.15
6.23.15
6.19.15
5.14.15
5.6.15
5.6.15
5.5.15
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, Nifi Registry, HAWQ, Zeppelin, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.
© 2011-2018 Hortonworks Inc. All Rights Reserved.
Comments
Hi Srinivas,
This is something of a great value addition to the hadoop infrastructure stack as such. Though I have not seen JIRA pages yet, what happened to providing mount functionality to hdfs on client machines using fuse system. Is there a difference in the functionality or in the approach (and possibly any gains?) of using nfs gateway than using mount of hdfs via fuse?
Nikhil
There are a few problems using FUSE to provide the NFS mount for HDFS.
Unlike NFSv3, FUSE is not inode based. FUSE usually uses path to generate the NFS file handle. Its path-handle mapping can make the host run out of memory. Even it can work around the memory problem, it could have correctness issue. FUSE may not be aware that a file’s path has been changed by other means(e.g., hadoop CLI). If FUSE is used on the client side, each NFS client has to install a client component which runs only on Linux so far.
Another question, just a quick thought, since NFS like behaviour is being brought to hadoop, can I use semantics of exports.local and say that particular namespaces are allowed only to a particular set of named hosts, ipaddresses or group of hosts via netgroups and put some user level acls as well? If yes, I would/can achieve the filesystem exposure via nfs to be controlled and restrictive.
Yes. JIRA HDFS-4947 is to track the effort to support export table in the NFS gateway.
Can the NFS gateway be collocated on each NFS client as a special case of multiple gateways case?
Yes. The NFS gateway machine needs everything to run an HDFS client, like Hadoop core JAR file, HADOOP_CONF directory.
The NFS gateway can be on any DataNode, NameNode, or any HDP client machine.
Can we setup muti NFS gateway services?
Like, we have several clients and they could connect to different services. In this way, the read/write throughput could be improved.
Yes. You can start multiple NFS gateways on DataNode or Client node to improve throughput.
Also, each gateway can export a different directory by configuring “dfs.nfs3.export.point”(renamed to “nfs.export.point” in Hadoop2.5 and later releases). By default, the only export is “/”.
Thanks,
Brandon
If I have multiple clients and multiple NFS gateways, I can mount gateway 1 to client 1-3, gateway 2 to client 4-6, but it is not managable for client sides. Does a load balancer work with multiple NFS gateways?
Another question: Which one is faster, NFS gateway or HttpFS?
Currently there is no built-in load balancer for NFS gateway. The client needs to manage the load on each mount point if multiple NFS gateways’ exports are mounted on the same client.
Regarding the performance, I think it depends on the workload. NFS gateway works well with large amount of small file manipulations.
Nice. A win for ease of management and ad-hoc file access/updates.
Hi, does the NFS gateway in HDP 2.0 support clusters using Kerberos?
Ryan
Not now, but Kerberos support is on the road map and we are working on the implementation.
Thanks,
Brandon
Thanks Brandon. I am trying to track down if Kerberos support is included in HDP 2.1, do you know if it is? I had taken code from this Jira and applied it to our HDP 2.0.
https://issues.apache.org/jira/browse/HDFS-5804
Thanks,
Ryan
Hi Ryan, HDP2.1 will include the Kerberos support. With HDP2.1, NFS gateway can access secure HDFS cluster as discussed in HDFS-5804.
Thanks,
Brandon
Is there any performance improvement if an application uses NFS to write data to HDFS?
NFS gateway uses DFSClient to access HDFS. Recently there is some performance improvement and comparison with FUSE in the JIRA HDFS-6080 (https://issues.apache.org/jira/browse/HDFS-6080).
How can I access HDFS by NFS?I don’t know the total method!
If you are using Apache Hadoop, here is the user guide for 2.3 release http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
If you are using HDP2.0, you can find the user guide here http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_user-guide/content/user-guide-hdfs-nfs.html
How can I realiaze the whole function about throughing nfs to access HDFS.Does it only configure hdfs-default.xml? then start portmap and nfs3 service? But I cann’t start portmap service,because it stop in the open ! So please help me , Thank you very much!
Is it crazy for me to consider installing an NFS gateway on each DataNode and using a load balancing mechanism (TCP or DNS RR) to spread requests across the cluster?
Would it be crazy to install an NFS gateway on each DataNode and use a load balancer (TCP or DNS RR) to spread NFS connections across the entire cluster?
Actually starting NFS gateway on multiple DataNodes is one way for increased throughput. For example, using each gateway to load/download different batch of files.
Since HDFS doesn’t support multiple writers, spreading writes of the same file to multiple NFS gateway will be a problem. However, reads should be fine.
Do we have to set up UID in NFS Server to implement AUTH_UNIX? What are the configuration changes required to implement AUTH_UNIX
AUTH_UNIX is usually the default NFS security policy and you don’t have to do anything special.
If the user accounts in your cluster is managed by name services as LDAP/NIS, the UID should be the same for the same user on both the client and server.
Thanks,
Brandon
Can I use NFS over HDFS for running the Virtual Machines in XenServer/VMWare
If you want to mount the NFS export on the virtual machine, yes, you can. It’s no different with mounting it on a physical Linux box.
Please let me know if I misunderstood your question.
Thanks,
Brandon
I would like to integrate HDFS and NFS such that I can create analytic pipelines using use open source tools that are not Hadoop aware (they depend on the NFS system) and then do the translation into HDFS. Does the NFS Gateway access to HDFS provide this ability? Are there drivers, modules available? Thanks!
There is no extra drivers/modules needed. After you start HDFS and NFS gateway, you can mount HDFS export as regular NFS export. In term of data ingestion, random write to an existing file is not supported yet. File append is supported.
Thanks,
Brandon
Hello,
Can someone explain me how NFS will work with HA cluster. In failover scenario ? More specific :
I have mounted my resources like that:
mount -t nfs -o vers=3,proto=tcp,nolock $server:/ $mount_point where $server is my Active NameNode ip.
Everything works fine. but suddenly some kind of error happens and my StandBy NameNode become Active NameNode ,
and my mount won`t work , im right ?
If im wrong correct me with some doc`s
Best regards , Daniel
Hi Daniel,
NFS gateway can be started on NameNode, DataNode or even a DFS client node. When you mount the export, you can use the IP address where the gateway is running. The gateway is essentially a DFSClient. When failover happens, the DFSClient inside NFS gateway can automatically connect to the new active NameNode.
Thanks,
Brandon
Thanks for fast respond Brandon. I might be greedy but can you support me with some example/documentation ?
Let`s hope that answer will satisfy my Boss.
I really appreciate your help.
Daniel
Here is the most recent released NFS user guide:
https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
So it is not possible to have proxy NFS ? If clientA mount HDFS from serverA, if serverA failed/reboot/crash, clientA has lose access to HDFS. it is correct ?