Security for Enterprise Hadoop

Authentication, Authorization, Accounting and Data Protection

Hadoop has become a business-critical data platform at many of the world’s largest enterprises. These corporations require four layers of security: authorization, authentication, accounting and data protection. Hortonworks continues to innovate in each of these areas, along with other members of the Apache open source community.

Securing a Hadoop Cluster

Authentication verifies the identity of a system or user accessing the cluster

Hadoop provides two modes of authentication: simple authentication and Kerberos authentication.

Hadoop provides these capabilities while relying on widely accepted corporate user-stores (such as LDAP or Active Directory) so that a single source can be used for credential catalog across Hadoop.

Authorization specifies access privileges for a user or system

Knox Gateway 0.4.0 introduces the features that enterprise security officers expect for perimeter security of a Hadoop cluster. Knox includes support for lookup of enterprise group permissions and also introduction of service-level access control. It adds protection from vulnerabilities in web applications and also a pluggable auditing facility.

The various Apache projects in a Hadoop distribution also include their own access control features. HDFS has file permissions for fine-grained authorization. MapReduce includes resource-level access control via ACL. For data, Apache HBase provides authorization with ACL on tables and column families and Apache Accumulo extends this further for cell-level access control. Also, Apache Hive provides coarse-grained access control on tables.

Accounting tracks resource use within a Hadoop system

For security compliance or forensics, insight into historical data access events is critical. HDFS and MapReduce provide base audit support. Apache Hive metastore records audit who interacts with Hive and when such interactions occur. Finally, Apache Oozie, the workflow engine, provides an audit trail for services.

Data protection ensures privacy and confidentiality of information

Hadoop and HDP allow you to protect data in motion. HDP provides encryption capability for various channels such as Remote Procedure Call (RPC), HTTP, JDBC/ODBC, and Data Transfer Protocol (DTP) to protect data in motion. HDFS and Hadoop support encryption at the operating system level.

Initiative Goals

Improve authentication choices and provide granular access controls for the Hadoop platform, services and data.
Enhance Hadoop’s accounting and data protection capabilities in support of broader enterprise reporting, auditing, billing, and compliance needs.
Integrate with existing enterprise security and identity management systems.


Owen O’Malley, Deveraj Das, and Sanjay Radia co-wrote the original Hadoop security specification in 2011. Since then, Hortonworks developers and coders from the open community have delivered core Kerberos functions in Hadoop (and then augmented this work with delegation tokens, capability-like access tokens and the notion of trust for auxiliary services.)

Continuing this leadership, the team at Hortonworks incubated the Apache Knox Gateway project in February 2013 to create a security perimeter for REST/HTTP access to Hadoop. Apache Knox version 0.4.0 will ship as a fully supported and certified component of HDP 2.1.

Phase 3 of the Hadoop security roadmap will deliver:

  • Audit event correlation & audit viewer
  • Support other token-based Authentication protocols (aka “NotOnlyKerberos”)
  • Data encryption for HDFS, HBase & Hive
  • Essential Timeline

    Phase 1
    • Strong authentication via Kerberos
    • HBase, Hive, HDFS basic auth
    • Encryption with SSL for NameNode, JobTracker, etc
    • Wire encryption for Shuffle, HDFS & JDBC
    Q4 2013
    Phase 2
    • ACLs for HDFS
    • Knox: Hadoop REST API Security
    • SQL-style Hive Authorization
    • SSL support for HiveServer2
    • SSL for DN/NN UI & WebHDFS
    • PAM Support for Hive
    Knox 0.4.0(HDP 2.1)
    Phase 3
    • Audit Event Correlation & Audit Viewer
    • Support Token-based Authentication Beyond Kerberos
    • Data Encryption in HDFS, HBase & Hive
    • Knox for HDFS HA, Ambari & Falcon


    Recently in the Blog

    Get started with Sandbox
    Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
    Hortonworks Data Platform
    The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
    Integrate with existing systems
    Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.

    Thank you for subscribing!