Security for Enterprise Hadoop
For many, Hadoop has been elevated to a business-c
Securing a Hadoop cluster today
Securing a system requires a redundant and layered approach. With the benefit of hundreds of production deployments running securely and at scale, best practices have emerged for securing Hadoop clusters. Let’s review some of the tools available across the four pillars of security below:
Authentication verifies the identity of a system or user accessing the system. Hadoop provides two modes of authentication. The first, simple or pseudo authentication, essentially places trust in user’s assertion about who they are. The second, Kerberos, provides a fully secure Hadoop cluster. In line with best practice, Hadoop provides these capabilities while relying on widely accepted corporate user-stores (such as LDAP or Active Directory) so that a single source can be used for credential catalog across Hadoop and existing systems.
Authorization specifies access privileges for a user or system. Hadoop provides fine-grained authorization via or with file permissions in HDFS and resource level access control (via ACL) for MapReduce and coarser grained access control at a service level. For data, HBase provides authorization with ACL on tables and column families and Accumulo extends this even further to cell level control. Also, Apache Hive provides coarse grained access control on tables.
Accounting provides the ability to track resource use within a system. Within Hadoop, insight into usage and data access is critical for compliance or forensics. As part of core Apache Hadoop, HDFS and MapReduce provide base audit support. Additionally, Apache Hive metastore records audit (who/when) information for Hive interactions. Finally, Apache Oozie, the workflow engine, provides audit trail for services.
Data Protection ensures privacy and confidentiality of information. Hadoop and HDP allow you to protect data in motion. HDP provides encryption capability for various channels such as Remote Procedure Call (RPC), HTTP, JDBC/ODBC, and Data Transfer Protocol (DTP) to protect data in motion. Finally, HDFS and Hadoop supports operating system level encryption.
Securing a Hadoop cluster tomorrow
As Hadoop evolves, so do the solutions to support enterprise security requirements. Much of the focus is centered around weaving the security frameworks together and to make them simple to manage. To this end, we present a roadmap for enterprise security in Hadoop defined by the following goals.
Security in Hadoop starts with strong authentication. Owen O’Malley and the Hortonworks team wrote the original specification and delivered the core Kerberos functions in Hadoop augmenting this work with delegation tokens, capability-like access tokens and the notion of trust for auxiliary services.
Continuing this leadership, the team at Hortonworks incubated the Apache Knox project in February 2013 to create a security perimeter for REST/HTTP access to Hadoop. In October 2013, 0.3.0 of Apache Knox was released and we anticipate the community to declare Knox generally available in Q1 of 2014.
In October 2013, wire encryption enhancements were released within HDP 2.0 and additional encryption enhancements across HDFS, Hive & HBase are coming in 2014.
In Q1 of 2014, enhancements to provide SQL-style authorization to Hive and HDFS Access Control Lists (ACL) are planned to be delivered.
We commit to publish much more over the next few months including a security reference architecture for Hadoop, but in the meanwhile you can tune your Hadoop security settings with some of these suggested tips.
- Strong authentication via Kerberos
- HBase and Hive authorization improvements
- Encryption with SSL for NameNode, JobTracker, etc
- Wire encryption for Shuffle, HDFS Data Transfer and JDBC/ODBC access to Hive
- Perimeter security for Hadoop via Apache Knox
- SQL-style authorization for Hive (GRANT, REVOKE)
- Access Control List for HDFS
- SSL support for HiveServer2
- Pluggable Authentication Module (PAM) support for Hive
- Audit event correlation & audit viewer
- Support other token-based Authentication protocols (aka “NotOnlyKerberos”)
- Data encryption for HDFS, HBase & Hive