Encryption is applied to electronic information in order to ensure its privacy and confidentiality. Typically, we think of protecting data as it rests or in motion. Wire Encryption protects the latter as data moves through Hadoop over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC.
Let’s cover the configuration required to encrypt each of these protocols. To see the step-by-step instructions please see the HDP 2.0 documentation.
RPC Encryption
The most common way for a client to interact with a Hadoop cluster is through RPC. A client connects to a NameNode (NN) over RPC protocol to read or write a file. RPC connections in Hadoop use Java’s Simple Authentication & Security Layer (SASL) which supports encryption. When hadoop.rpc.protection property is set to privacy the data over RPC is encrypted with symmetric keys. Please refer to this post for more details on hadoop.rpc.protection
setting.
Data Transfer Protocol
The NN gives the client the address of the first DataNode (DN) to read or write the block. The actual data transfer between the client and a DN is over Data Transfer Protocol. To encrypt data transfer you must set dfs.encryt.data.transfer=true
on NN and all DNs. The actual algorithm used for encryption can be customized with dfs.encrypt.data.transfer.algorithm
set to either 3des
or rc4
. If nothing is set, then the default on the system is used (usually 3DES
.) While 3DES is more cryptographically secure, RC4 is substantially faster.
HTTPS Encryption
Encryption over the HTTP protocol is implemented with the support for SSL across a Hadoop cluster. For example, to enable NN UI to listen for HTTP over SSL you must configure SSL on the NN and all the DNs by setting dfs.https.enable=true
in hdfs-site.xml
. Typically SSL is configured to only authenticate the Server-this is called 1-way SSL. In addition, SSL can also be configured to authenticate the client-this is called mutual authentication or 2-way SSL. To configure 2-way SSL set dfs.client.https.need-auth=true
in hdfs-site.xml
on each NN and DN. For 1-way SSL only the keystore needs to be configured on the NN and DN. The keystore & the truststore configuration go in the ssl-server.xml
and ssl-client.xml
file on the NN and each DN. The truststore configuration is only needed when using a self-signed certificate or a certificate that is not in the JVM’s truststore.
The following configuration properties need to be specified in ssl-server.xml
.
Property |
Default Value |
Description |
ssl.server.keystore.type |
JKS |
The type of the keystore, JKS = Java Keystore, the de-facto standard in Java |
ssl.server.keystore.location |
None |
The location of the keystore file |
ssl.server.keystore.password |
None |
The password to open the keystore file |
ssl.server.truststore.type |
JKS |
The type of the trust store |
ssl.server.truststore.location |
None |
The location of the truststore file |
ssl.server.truststore.password |
None |
The password to open the truststore |
Encryption during Shuffle
Staring HDP 2.0 encryption during shuffle is supported.
The data moves between the Mappers and the Reducers over the HTTP protocol, this step is called shuffle. Reducer initiates the connection to the Mapper to ask for data and acts as SSL client. Enabling HTTPS for encrypting shuffle traffic involves the following steps.
-
Set
mapreduce.shuffle.ssl.enabled
to true inmapred-site.xml
-
Set keystore properties and optionally truststore (for 2-way SSL) properties mentioned in the above table.
Here is an example configuration from mapred-site.xml
[xml]
<property>
<name>hadoop.ssl.enabled</name>
<value>true</value>
</property>
<property>
<name>hadoop.ssl.require.client.cert</name>
<value>false</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.hostname.verifier</name>
<value>DEFAULT</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.keystores.factory.class</name>
<value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
<final>true</final>
</property>
[/xml]
The above configuration refers to a ssl-server.xml
and ssl-client.xml
. These files will contain properties as specified in the table above. Make sure to put ssl-server.xml
and ssl-client.xml
in the default ${HADOOP_CONF_DIR}
.
JDBC
HiveServer2 implements encryption with Java SASL protocol’s quality of protection (QOP) setting. With this the data moving between a HiveServer2 over jdbc and a jdbc client can be encrypted. On the HiveServer2, set hive.server2.thrift.sasl.qop
in hive-site.xml
, and on the JDBC client specify sasl.sop
as part of jdbc hive connection string. eg jdbc:hive://hostname/dbname;sasl.qop=auth-int
. HIVE-4911 provides more details on this enhancement.
Closing Thoughts
Ensuring confidentiality of the data flowing in an out of a Hadoop cluster requiring configuring encryption on each channel that is being used to move the data. The blog describes encryption configuration required for encryption for various channels.
Please send me any comments about this post and any topic you would like me to cover. Stay tuned for the next post about authorization in Hadoop. And you can stay up-to-date on Security innovation in Hadoop via our Labs Page.