Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
August 11, 2011
prev slideNext slide

Motivations for Apache Hadoop Security

As the former technical lead for the Yahoo! team that added security to Apache Hadoop, I thought I would provide a brief history.

The motivation for adding security to Apache Hadoop actually had little to do with traditional notions of security in defending against hackers since all large Hadoop clusters are behind corporate firewalls that only allow employees access. Instead, the motivation was simply that security would allow us to use Hadoop more effectively to pool resources between disjointed groups. Larger clusters are much cheaper to operate and require fewer copies of duplicated data.

History and Motivation
When Yahoo! started with Apache Hadoop in February 2006, it was a component of Apache Lucene and only supported small 20 machine clusters that usable by a small team. In that environment, a lack of security did not matter. All of the users knew and trusted each other, and the information stored in Hadoop was available from other sources.
As Yahoo!’s usage of Hadoop expanded, we were always focused on making Apache Hadoop operate more effectively on large clusters. These large clusters operate as a shared service available to thousands of Yahoo! employees working on a wide range of projects. They allow “borrowing” of unused capacity between groups and are substantially cheaper to administer. For example, Yahoo! currently staffs the Hadoop operations group at 1 operator for every 3,000 machines, which is very low relative to similar systems.

Prior to Apache Hadoop, if a Yahoo! employee had an idea that required new analysis, they had to request hardware, find space in a data center, order it, get it provisioned, get access to the raw data, write their analysis program and then ultimately run it. The whole process took months. Now, it takes only a few hours to be authorized on the shared Hadoop clusters and the analyst is up and running.

The primary limitation to sharing Hadoop clusters has always been that Apache Hadoop had no mechanisms for verifying the identity of the user. Thus, in spite of having file owners and permissions in HDFS, it was trivial to impersonate a different user. Therefore, sensitive data such as PII or financial data had to be isolated on clusters with restricted user access, which provides security at the cost of limited cluster sharing.

A software engineering precept is that you should never add security to an existing product. However, supporting large shared Hadoop clusters required us to do exactly that. It was imperative that the code ensured that each and every access to HDFS was authenticated correctly and that the user had the appropriate access. Naturally, MapReduce jobs and tasks, which are the primary agents of HDFS access, also need to be authenticated.

Distributed authentication is very difficult and thus our team chose to use Kerberos, a well-proven authentication system. Kerberos provides user-facing commands for managing their credentials and libraries for supporting remote authentication. Another attractive feature for Kerberos is that it supports multiple authorization authorities that only partially trust each other. Since Microsoft’s ActiveDirectory is compatible with Kerberos, most organizations already have a Kerberos key server as a critical part of their infrastructure.

Although, Kerberos is convenient for most organizations that need Apache Hadoop security, Hadoop’s authentication sub-system was built on extensible standards such as JAAS and SASL that support alternative authentication schemes. These standards should let Hadoop keep pace as authentication systems evolve.

The security work has been on Yahoo’s shared clusters since April 2010 and production clusters since August 2010. Due to the planning and inter-group coordination, it was one of the smoothest major Hadoop rollouts ever at Yahoo!. More importantly, all of this work has been contributed to Apache.

The team worked extremely hard to get the project done in the required timeframe. The development team consisted of Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, and Kan Zhang.

We created an Apache Hadoop release named that includes all of the security work. This is the version of Apache Hadoop that Yahoo! has run on its 42,000 computers and it has proven to be very stable, even on large 4,500 machine clusters. The next release,, is in the process of being tested by different organizations and should be released as soon as it is ready.

Stay tuned for additional blogs on security, including an upcoming blog by Devaraj Das in which he will discuss delegation tokens and how they are used in enforcing security in Hadoop.

— Owen O’Malley



Shyam says:

Hi ,

I just had a doubt about the key exchange between the namenode and the datanode . Since Symmetric key cryptography is utilised for the namenode authentication with the datanode which is running as a JVM instance on every slave , I was wondering if this key in the current hadoop implementation is the same key or pairwise .

example : for uniform same key type , NM->DN1 , NM->DN2 … DM->DNn authenticate with a shared key Ksh

or if it is pairwise symmetric, like NM -> DN1 with K1 , NM -> DN2 with K2 …. NM -> DNn with Kn

Where DN is Datanode, NM ->Namenode .

Please correct me if my question is wrong..



Owen O'Malley says:

The RPC communication between the DataNodes and the NameNode is mutually authenticated using Kerberos (via GSSAPI and SASL). Each of the DataNodes and the NameNode has a unique random key that is generated for it and known by the Kerberos KDC. So in the context of your question it is pair-wise, although it is really pairwise with the KDC rather than the NameNode.

The validation of the block tokens is managed by having the NameNode generate a secret key once a day that is distributed to each of the NameNodes. The block tokens include an HMAC based on that secret and so all of the DataNodes can validate that each block token is valid.

Leave a Reply

Your email address will not be published. Required fields are marked *