The Role of Delegation Tokens in Apache Hadoop Security
Delegation tokens play a critical part in Apache Hadoop security, and understanding their design and use is important for comprehending Hadoop’s security model.
Download our technical paper on adding security to Hadoop here.
Authentication in Apache Hadoop
Apache Hadoop provides strong authentication for HDFS data. All HDFS accesses must be authenticated:
1. Access from users logged in on cluster gateways
2. Access from any other service or daemon (e.g. HCatalog server)
3. Access from MapReduce tasks
Hadoop relies on Kerberos, a three party authentication protocol, to do the authentication for #1 and #2 above. Users and services use their Kerberos credentials to talk securely with the NameNode. For #3, one of the conscious decisions was not to use Kerberos. I won’t cover all of the reasons but they had to do with secure delegation of credentials, performance and reliability. A technical report on security design in Apache Hadoop will be published soon with more detail.
Delegation token authentication is a two-party authentication protocol based on Java SASL Digest-MD5. The token is obtained during job submissions and submitted to the JobTracker as part of the job submission. The typical steps are:
1. User authenticates herself to the JobTracker using Kerberos.
2. User authenticates herself (using Kerberos) to the NameNode(s) that the tasks would interact with at runtime. User then gets a delegation token from each of the NameNodes.
3. User passes the tokens to the JobTracker as part of the job submission.
All TaskTrackers running the jobs’ tasks get a copy of the tokens (via an HDFS location that is private to the user that the MapReduce daemons runs as). The tokens are written to a file in a private area (visible to the job-owner user) on the TaskTracker machine.
As part of launching the task, the TaskTracker exports the location of the token file as an environment variable. The task process loads the tokens in memory (the file is read as a part of the static initialization of the UserGroupInformation class). This information is useful for the RPC client.
In the mode where security is enabled, the Apache Hadoop RPC client can talk securely with a server using either tokens or Kerberos. The RPC client is programmed in such a way that if a token exists for a service, it will be used for secure communication. If a token doesn’t exist, Kerberos is used.
Any Apache Hadoop process that is launched from the task (for example, a Hadoop Streaming process running a standard HDFS CLI command) can get access to those tokens since the environment variable is also visible to the child processes. Using the tokens, these processes can transparently authenticate themselves with the NameNodes (since the CLI implementation uses the standard security aware UserGroupInformation and RPC client classes).
Lifecycle of a Token
A token has a current life, and a maximum renewable life (similar to Kerberos tickets). By default, tokens must be renewed once every 24 hours for up to 7 days. Tokens can also be cancelled explicitly.
The NameNode permits only designated renewers to do the renewal. For MapReduce jobs, the renewer is specified as the JobTracker. The JobTracker keeps track of tokens’ lifetimes and when they are about to expire, the JobTracker renews them with the respective NameNodes. When a job is done, JobTracker cancels the tokens associated with the job.
Other Uses of Tokens
Token has uses beyond HDFS. The token abstractions are fairly open. Thus, the token-related classes could easily be reused and overridden for various other uses. For example, the Token selector can be defined on a per token-type basis. A selector is associated with an RPC protocol via an annotation (as an example, ClientProtocol is annotated with DelegationTokenSelector). The selector usually has logic that looks at all the passed tokens and returns one that is meant for the protocol/service in question. Every token contains information to indicate its intended service (IP-address:port, or, some application defined data appropriate to the context in which the token is used).
One use case outside HDFS of delegation tokens is the MapReduce delegation Token case. For jobs that in turn submit other jobs (for example, the launcher job in Oozie), the tasks of the first job need to talk securely with the JobTracker. Oozie uses MapReduce delegation tokens for this authentication. Before the launcher job is submitted, a request is made to the JobTracker for a MapReduce delegation token, and that token is added to the list of tokens passed to the JobTracker during the launcher job submission.
Another use case is that of HCatalog. MapReduce tasks that use the HCatalog service requires tokens (issued by the HCatalog server during job submission) to talk securely with the HCatalog server.
Some Notes on Issuing Delegation Tokens
The delegation tokens should be issued on Kerberos authenticated channels only. The problem that we are trying to avert is this: if a delegation token is compromised, the compromised delegation token can then be used to get more delegation tokens. Plus, the malicious user can stay connected to the server for a long period of time.
For services such as Oozie that act on behalf of other users, more security checks are in place in Apache Hadoop. An Oozie request for a delegation token is honored only when Oozie asks a delegation token for a user belonging to one of a certain set of user-groups. Furthermore, the Hadoop service checks whether the request for delegation token came from a host that it trusts. Both these can be configured in Apache Hadoop.
Delegation tokens are a fundamental building block of the Apache Hadoop Security design and architecture. The design of delegation token abstractions that are defined in Apache Hadoop have proven to be robust and easily extensible to use cases such as HDFS, MapReduce, HCatalog.
Use of delegation tokens in the core Hadoop infrastructure has helped to both reduce Kerberos traffic significantly (and thereby putting less load on the Kerberos infrastructure) and keep the performance regressions of MapReduce jobs under control (because token-based authentication, which accounts for the majority of the authentication in MapReduce jobs, is a much simpler 2-party authentication as compared to 3-party authentication found in Kerberos).
I should note that Kan Zhang, who was a part of the original Apache Hadoop security development team, was responsible for coming up with the delegation token concept.
— Devaraj Das