Apache Hadoop is equipped with a robust and scalable security infrastructure. It is being used at some of the biggest cluster installations in the world, where hundreds of terabytes of sensitive and critical data are processed every day.
Owen O’Malley provided a nice overview of Apache Hadoop security in his blog Motivations for Apache Hadoop Security. Devaraj Das also covered some of the core pieces of Apache Hadoop’s security architecture in his blog The Role of Delegation Tokens in Apache Hadoop Security.
The intent of this blog is to cover some of the features of the Apache Hadoop security infrastructure that will help cluster administrators fine-tune the security settings of their clusters.
Security infrastructure for Hadoop RPC uses Java SASL APIs. Quality of Protection (QOP) settings can be used to enable encryption for Hadoop RPC protocols.
Java SASL provides following QOP settings:
Hadoop lets cluster administrators control the quality of protection via the configuration parameter “hadoop.rpc.protection” in core-site.xml. It is an optional parameter and if not present the default QOP setting of “auth” is used, which implies “authentication only”. The valid values for this parameter are:
The default setting is kept as authentication only because integrity checks and encryption have a cost in terms of performance.
The Apache Hadoop daemon processes (Datanode, Namenode, Tasktracker, Jobtracker) in a secure Hadoop installation, each have a Kerberos principal. For example a datanode principal could look like, datanode/datanode-hostname@realm. It is a common practice to use a hostname in the middle because it gives uniqueness to the principal names for each datanode or tasktracker. There are two main reasons why it is important to use unique principal names.
However, hostname in the principal means that the datanode principal must be separately configured for each datanode in the cluster, which could mean several hundred machines. Hadoop provides a cool feature to simplify the configuration. In hdfs-site.xml (or mapred-site.xml for task trackers), the principals can also be specified using the _HOST string for the hostname in the middle. The principal in the datanode example mentioned above can also be specified as datanode/_HOST@realm in the configuration file. Please note that the actual principal is still datanode/datanode-hostname@realm, and _HOST is just a placeholder for datanode-hostname. Hadoop interprets and replaces _HOST appropriately wherever needed. Thus, each datanode has the same value for dfs.datanode.kerberos.principal in the configuration even though the principals are different.
Hadoop uses group memberships of users at various places, such as to determine group ownership for files or for access control. A user is mapped to the groups it belongs to using an implementation of the GroupMappingServiceProvider interface. The implementation is pluggable and can be configured in core-site.xml.
Hadoop by default uses ShellBasedUnixGroupsMapping, which is an implementation of GroupMappingServiceProvider. It fetches the group membership for a user name by executing a UNIX shell command.
In secure clusters, since the user names are actually kerberos principals, ShellBasedUnixGroupsMapping will work only if the kerberos principlals map to valid UNIX user names.
Hadoop provides a feature that lets administrators specify mapping rules to map a kerberos principal to a local UNIX user name.
The rules are specified in core-site.xml with configuration key “hadoop.security.auth_to_local”. For example:
hadoop.security.auth_to_local RULE:[1:$1@$0](.*@YOUR.REALM)s/@.*// RULE:[2:$1@$0](hdfs@.*YOUR.REALM)s/.*/hdfs/ DEFAULT
The rest of this section explains how these rules are interpreted and specified.
The default rule is simply “DEFAULT”, which takes all principals in your default domain to their first component. For example, “username@APACHE.ORG” and “username/admin@APACHE.ORG” to “username”, if your default domain is APACHE.ORG.
The translations rules have 3 sections: base, filter, and substitution.
The base is the number of components in the principal name excluding the realm and the pattern for building the name from the sections of the principal name. The base uses $0 to mean the realm, $1 to mean the first component and $2 to mean the second component.
[1:$1@$0] translates “username@APACHE.ORG” to “username@APACHE.ORG”
[2:$1] translates “username/admin@APACHE.ORG” to “username”
[2:$1%$2] translates “username/admin@APACHE.ORG” to “username%admin”
The filter is a regex in parentheses that must be the generated string for the rule to apply.
“(.*%admin)” will take any string that ends in “%admin”
“(.*@SOME.DOMAIN)” will take any string that ends in “@SOME.DOMAIN”
Finally, the substitution is a sed rule to translate a regex into a fixed string.
“s/@ACME.COM//” removes the first instance of “@SOME.DOMAIN”.
“s/@[A-Z]*.COM//” removes the first instance of “@” followed by a name followed by “.COM”.
“s/X/Y/g” replaces all of the “X” in the name with “Y”
So, if your default realm was APACHE.ORG, but you also wanted to take all principals from SOME.DOMAIN that had a single component “joe@SOME.DOMAIN”, you would use:
To translate the names with a second component, you would make the rules:
RULE:[1:$1@$0](.@SOME.DOMAIN)s/@.// RULE:[2:$1@$0](.@SOME.DOMAIN)s/@.// DEFAULT
If you want to treat all principals from APACHE.ORG with /admin as “admin”, your rules would look like:
Apache Hadoop security was a collaborative effort of a team of engineers. I credit the content of this article to their outstanding work and extend my special thanks to Owen for the detailed explanation of auth_to_local rules and to Devaraj for his valuable suggestions.
— Jitendra Pandey