The motivation for adding security to Apache Hadoop actually had little to do with traditional notions of security in defending against hackers since all large Hadoop clusters are behind corporate firewalls that only allow employees access. Instead, the motivation was simply that security would allow us to use Hadoop more effectively to pool resources between disjointed groups. Larger clusters are much cheaper to operate and require fewer copies of duplicated data.
History and Motivation
When Yahoo! started with Apache Hadoop in February 2006, it was a component of Apache Lucene and only supported small 20 machine clusters that usable by a small team. In that environment, a lack of security did not matter. All of the users knew and trusted each other, and the information stored in Hadoop was available from other sources.
As Yahoo!’s usage of Hadoop expanded, we were always focused on making Apache Hadoop operate more effectively on large clusters. These large clusters operate as a shared service available to thousands of Yahoo! employees working on a wide range of projects. They allow “borrowing” of unused capacity between groups and are substantially cheaper to administer. For example, Yahoo! currently staffs the Hadoop operations group at 1 operator for every 3,000 machines, which is very low relative to similar systems.
Prior to Apache Hadoop, if a Yahoo! employee had an idea that required new analysis, they had to request hardware, find space in a data center, order it, get it provisioned, get access to the raw data, write their analysis program and then ultimately run it. The whole process took months. Now, it takes only a few hours to be authorized on the shared Hadoop clusters and the analyst is up and running.
The primary limitation to sharing Hadoop clusters has always been that Apache Hadoop had no mechanisms for verifying the identity of the user. Thus, in spite of having file owners and permissions in HDFS, it was trivial to impersonate a different user. Therefore, sensitive data such as PII or financial data had to be isolated on clusters with restricted user access, which provides security at the cost of limited cluster sharing.
A software engineering precept is that you should never add security to an existing product. However, supporting large shared Hadoop clusters required us to do exactly that. It was imperative that the code ensured that each and every access to HDFS was authenticated correctly and that the user had the appropriate access. Naturally, MapReduce jobs and tasks, which are the primary agents of HDFS access, also need to be authenticated.
Distributed authentication is very difficult and thus our team chose to use Kerberos, a well-proven authentication system. Kerberos provides user-facing commands for managing their credentials and libraries for supporting remote authentication. Another attractive feature for Kerberos is that it supports multiple authorization authorities that only partially trust each other. Since Microsoft’s ActiveDirectory is compatible with Kerberos, most organizations already have a Kerberos key server as a critical part of their infrastructure.
Although, Kerberos is convenient for most organizations that need Apache Hadoop security, Hadoop’s authentication sub-system was built on extensible standards such as JAAS and SASL that support alternative authentication schemes. These standards should let Hadoop keep pace as authentication systems evolve.
The security work has been on Yahoo’s shared clusters since April 2010 and production clusters since August 2010. Due to the planning and inter-group coordination, it was one of the smoothest major Hadoop rollouts ever at Yahoo!. More importantly, all of this work has been contributed to Apache.
The team worked extremely hard to get the project done in the required timeframe. The development team consisted of Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, and Kan Zhang.
We created an Apache Hadoop release named 0.20.203.0 that includes all of the security work. This is the version of Apache Hadoop that Yahoo! has run on its 42,000 computers and it has proven to be very stable, even on large 4,500 machine clusters. The next release, 0.20.204.0, is in the process of being tested by different organizations and should be released as soon as it is ready.
Stay tuned for additional blogs on security, including an upcoming blog by Devaraj Das in which he will discuss delegation tokens and how they are used in enforcing security in Hadoop.
— Owen O’Malley