WebHDFS – HTTP REST Access to HDFS

Motivation

Apache Hadoop provides a high performance native protocol for accessing HDFS. While this is great for Hadoop applications running inside a Hadoop cluster, users often want to connect to HDFS from the outside. For examples, some applications have to load data in and out of the cluster, or to interact with the data stored in HDFS from the outside. Of course they can do this using the native HDFS protocol but that means installing Hadoop and a Java binding with those applications. To address this we have developed an additional protocol to access HDFS using an industry standard RESTful mechanism, called WebHDFS. As part of this, WebHDFS takes advantages of the parallelism that a Hadoop cluster offers. Further, WebHDFS retains the security that the native Hadoop protocol offers. It also fits well into the overall strategy of providing web services access to all Hadoop components.

WebHDFS Features

A Complete HDFS Interface: WebHDFS supports all HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. In contrast, HFTP (a previous version of HTTP protocol heavily used at Yahoo!) only supports the read operations but not the write operations. Read operations are the operations that do not change the status of HDFS including the namespace tree and the file contents.

HTTP REST API: WebHDFS defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop. You can use common tools like curl/wget to access HDFS.

Wire Compatibility: the REST API will be maintained for wire compatibility. Thus, WebHDFS clients can talk to clusters with different Hadoop versions.

Secure Authentication: The core Hadoop uses Kerberos and Hadoop delegation tokens for security. WebHDFS also uses Kerberos (SPNEGO) and Hadoop delegation tokens for authentication.

Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.

A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install

Apache Open Source: All the source code and documentation have been committed to the Hadoop code base. It will be released with Hadoop 1.0. For more information, read the preliminary version of the WebHDFS REST API specification.

Simple Examples

Some examples are shown below using the curl command tool to access HDFS via WebHDFS REST API.

Reading a file /foo/bar

curl -i -L "http://host:port/webhdfs/v1/foo/bar?op=OPEN"

Then, curl follows the Temporary Redirect response to a datanode and obtains the file data.

HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Location: http://datanode:50075/webhdfs/v1/foo/bar?op=OPEN&offset=0
Content-Length: 0
 
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 22
 
Hello, webhdfs user!

Listing the status of a file

curl -i "http://host:port/webhdfs/v1/foo/bar?op=GETFILESTATUS"

Then, receive a response with the file status JSON object.

HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
 
{
  "FileStatus":
  {
    "accessTime"      : 1322596581499,
    "blockSize"       : 67108864,
    "group"           : "supergroup",
    "length"          : 22,
    "modificationTime": 1322596581499,
    "owner"           : "szetszwo",
    "pathSuffix"      : "",
    "permission"      : "644",
    "replication"     : 3,
    "type"            : "FILE"
  }
}

The HTTP responses are omitted for the following examples.

Listing a directory /foo

curl -i "http://host:port/webhdfs/v1/foo/?op=LISTSTATUS"

Renaming the file /foo/bar to /foo/bar2

curl -i -X PUT "http://host:port/webhdfs/v1/foo/bar?op=RENAME&destination=/foo/bar2"

Making a directory /foo2

curl -i -X PUT "http://host:port/webhdfs/v1/foo2?op=MKDIRS"

Related Components and a Brief History

HFTP – this was the first mechanism that provided HTTP access to HDFS. It was designed to facilitate data copying between clusters with different Hadoop versions. HFTP is a part of HDFS. It redirects clients to the datanode containing the data for providing data locality. Nevertheless, it supports only the read operations. The HFTP HTTP API is neither curl/wget friendly nor RESTful.  WebHDFS is a rewrite of HFTP and is intended to replace HFTP.

HdfsProxy - a HDFS contrib project. It runs as external servers (outside HDFS) for providing proxy service. Common use cases of HdfsProxy are firewall tunneling and user authentication mapping.

HdfsProxy V3 – Yahoo!’s internal version that has a dramatic improvement over HdfsProxy. It has a HTTP REST API and other features like bandwidth control. Nonetheless, it is not yet publicly available.

Hoop – a rewrite of HdfsProxy. It aims to replace HdfsProxy. Hoop has a HTTP REST API. Like HdfsProxy, it runs as external servers to provide a proxy service. Because it is a proxy running outside HDFS, it cannot take advantages of some features such as redirecting clients to the corresponding datanodes for providing data locality. It has advantages, however, in that it can be extended to control and limit bandwidth like HdfsProxy V3, or to carry out authentication translation from one mechanism to HDFS’s native Kerberos authentication. Also, it can serve proxy service to other file systems such as Amazon S3 via Hadoop FileSystem API. At the time of writing of this blog, Hoop is in a process of being committed to Hadoop as a HDFS contrib project.

What is Next?

WebHDFS opens up opportunities for many new tools. For example, tools like FUSE or C/C++ client libraries using WebHDFS are fairly straightforward to be written. It allows existing Unix/Linux utilities and non-Java applications to interact with HDFS. Besides, there is no Java binding in those tools and Hadoop installation is not required.

Acknowledgement

WebHDFS was developed by Hortonworks engineers. Our work was influenced by HFTP, HdfsProxy, HdfsProxy V3 and Hoop. The idea of client redirection was taken from the HFTP work done originally at Yahoo!. We thank the Apache Hadoop community for their feedback, especially Alejandro Abdelnur for feedback at various stages and for the discussions in making WebHDFS and Hoop to be compatible at API level.

~Nicholas Sze

Categorized by :
Apache Hadoop HDFS

Comments

|
March 12, 2012 at 1:25 am
|

It turns out to be very easy to write client libraries for WebHDFS. I find one python version at https://github.com/drelu/webhdfs-py. And I wrote a ruby one at https://github.com/zenja/webhdfs-ruby.

|
December 29, 2011 at 9:39 am
|

Ah! Miss the meat!

Hoop – a rewrite of HdfsProxy. It aims to replace HdfsProxy. Hoop has a HTTP REST API. Like HdfsProxy, it runs as external servers to provide a proxy service. Because it is a proxy running outside HDFS, it cannot take advantages of some features such as redirecting clients to the corresponding datanodes for providing data locality. It has advantages, however, in that it can be extended to control and limit bandwidth like HdfsProxy V3, or to carry out authentication translation from one mechanism to HDFS’s native Kerberos authentication. Also, it can serve proxy service to other file systems such as Amazon S3 via Hadoop FileSystem API. At the time of writing of this blog, Hoop is in a process of being committed to Hadoop as a HDFS contrib project.

Nicholas Sze
|
December 27, 2011 at 10:22 pm
|

@Frederik,

Regarding to FUSE implementation using WebHDFS, two JIRAs, HDFS-2656 (C client based on WebHDFS) and HDFS-2631 (rewrite fuse-dfs using WebHDFS), were filed. You may want to watch them closely.

@Andre,

There are only a conf property to enable/disable WebHDFS and two more conf properties for SPNEGO authentication. See Section “HDFS Configuration Options” in the WebHDFS doc in Hadoop 1.0 (a unreleased version). For your conventance, I have generated and posted the doc at http://people.apache.org/~szetszwo/hadoop1.0/webhdfs.html

Thanks for both of you!
Nicholas

Andre
|
December 26, 2011 at 5:19 am
|

Thanks for this great post. Some information on how to configure Hadoop in order to use WebHDFS would be really useful (maybe this would be a good follow up post).

|
December 22, 2011 at 11:26 pm
|

Hi,

Can I have more details about the FUSE implementation? When would that be integrated into the Hortonworks Data Platform?

I need to use tools I don’t control, and the only way is to use FUSE … ;/

Regards,
Fred

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.