WebHDFS – HTTP REST Access to HDFS

WebHDFS – HTTP REST Access to HDFS

Motivation

Apache Hadoop provides a high performance native protocol for accessing HDFS. While this is great for Hadoop applications running inside a Hadoop cluster, users often want to connect to HDFS from the outside. For examples, some applications have to load data in and out of the cluster, or to interact with the data stored in HDFS from the outside. Of course they can do this using the native HDFS protocol but that means installing Hadoop and a Java binding with those applications. To address this we have developed an additional protocol to access HDFS using an industry standard RESTful mechanism, called WebHDFS. As part of this, WebHDFS takes advantages of the parallelism that a Hadoop cluster offers. Further, WebHDFS retains the security that the native Hadoop protocol offers. It also fits well into the overall strategy of providing web services access to all Hadoop components.

WebHDFS Features

A Complete HDFS Interface: WebHDFS supports all HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. In contrast, HFTP (a previous version of HTTP protocol heavily used at Yahoo!) only supports the read operations but not the write operations. Read operations are the operations that do not change the status of HDFS including the namespace tree and the file contents.

HTTP REST API: WebHDFS defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop. You can use common tools like curl/wget to access HDFS.

Wire Compatibility: the REST API will be maintained for wire compatibility. Thus, WebHDFS clients can talk to clusters with different Hadoop versions.

Secure Authentication: The core Hadoop uses Kerberos and Hadoop delegation tokens for security. WebHDFS also uses Kerberos (SPNEGO) and Hadoop delegation tokens for authentication.

Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data.

A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install

Apache Open Source: All the source code and documentation have been committed to the Hadoop code base. It will be released with Hadoop 1.0. For more information, read the preliminary version of the WebHDFS REST API specification.

Simple Examples

Some examples are shown below using the curl command tool to access HDFS via WebHDFS REST API.

Reading a file /foo/bar

curl -i -L "http://host:port/webhdfs/v1/foo/bar?op=OPEN"

Then, curl follows the Temporary Redirect response to a datanode and obtains the file data.

HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Location: http://datanode:50075/webhdfs/v1/foo/bar?op=OPEN&offset=0
Content-Length: 0

HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 22

Hello, webhdfs user!

Listing the status of a file

curl -i "http://host:port/webhdfs/v1/foo/bar?op=GETFILESTATUS"

Then, receive a response with the file status JSON object.

HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked

{
  "FileStatus":
  {
    "accessTime"      : 1322596581499,
    "blockSize"       : 67108864,
    "group"           : "supergroup",
    "length"          : 22,
    "modificationTime": 1322596581499,
    "owner"           : "szetszwo",
    "pathSuffix"      : "",
    "permission"      : "644",
    "replication"     : 3,
    "type"            : "FILE"
  }
}

The HTTP responses are omitted for the following examples.

Listing a directory /foo

curl -i "http://host:port/webhdfs/v1/foo/?op=LISTSTATUS"

Renaming the file /foo/bar to /foo/bar2

curl -i -X PUT "http://host:port/webhdfs/v1/foo/bar?op=RENAME&destination=/foo/bar2"

Making a directory /foo2

curl -i -X PUT "http://host:port/webhdfs/v1/foo2?op=MKDIRS"

Related Components and a Brief History

HFTP – this was the first mechanism that provided HTTP access to HDFS. It was designed to facilitate data copying between clusters with different Hadoop versions. HFTP is a part of HDFS. It redirects clients to the datanode containing the data for providing data locality. Nevertheless, it supports only the read operations. The HFTP HTTP API is neither curl/wget friendly nor RESTful.  WebHDFS is a rewrite of HFTP and is intended to replace HFTP.

HdfsProxy – a HDFS contrib project. It runs as external servers (outside HDFS) for providing proxy service. Common use cases of HdfsProxy are firewall tunneling and user authentication mapping.

HdfsProxy V3 – Yahoo!’s internal version that has a dramatic improvement over HdfsProxy. It has a HTTP REST API and other features like bandwidth control. Nonetheless, it is not yet publicly available.

Hoop – a rewrite of HdfsProxy. It aims to replace HdfsProxy. Hoop has a HTTP REST API. Like HdfsProxy, it runs as external servers to provide a proxy service. Because it is a proxy running outside HDFS, it cannot take advantages of some features such as redirecting clients to the corresponding datanodes for providing data locality. It has advantages, however, in that it can be extended to control and limit bandwidth like HdfsProxy V3, or to carry out authentication translation from one mechanism to HDFS’s native Kerberos authentication. Also, it can serve proxy service to other file systems such as Amazon S3 via Hadoop FileSystem API. At the time of writing of this blog, Hoop is in a process of being committed to Hadoop as a HDFS contrib project.

What is Next?

WebHDFS opens up opportunities for many new tools. For example, tools like FUSE or C/C++ client libraries using WebHDFS are fairly straightforward to be written. It allows existing Unix/Linux utilities and non-Java applications to interact with HDFS. Besides, there is no Java binding in those tools and Hadoop installation is not required.

Acknowledgement

WebHDFS was developed by Hortonworks engineers. Our work was influenced by HFTP, HdfsProxy, HdfsProxy V3 and Hoop. The idea of client redirection was taken from the HFTP work done originally at Yahoo!. We thank the Apache Hadoop community for their feedback, especially Alejandro Abdelnur for feedback at various stages and for the discussions in making WebHDFS and Hoop to be compatible at API level.

~Nicholas Sze

Nicholas Sze
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.