cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
August 29, 2014
prev slideNext slide

Deploying HTTPS in HDFS

Haohui Mai is a member of technical staff at Hortonworks in the HDFS group and a core Hadoop committer. In this blog, he explains how to setup HTTPS for HDFS in a Hadoop cluster.

1. Introduction

The HTTP protocol is one of the most widely used protocols in the Internet. Today, Hadoop clusters exchange internal data such as file system images, the quorum journals, and the user data through the HTTP protocol. Since HTTP transfers the data in clear-text, an attacker might be able to tap into the network and put your valuable data at risk.

To protect your data, we have implemented full HTTPS support for HDFS in HDP 2.1. (Thanks to Hortonworks’ Haohui Mai, Suresh Srinivas, and Jing Zhao). At a very high level, HTTPS is the HTTP protocol transported over Secure Socket Layer (SSL/TLS), which prevents wiretapping and man-in-the-middle attack. The rest of this blog post describes how HTTPS works, and how to set it up for HDFS in a Hadoop cluster.

2. Background

Figure 1 describes the basic HTTP / HTTPS communication workflow. In the figure, Alice and Bob want to exchange information. From Alice’s perspective, there are two types of security threats in the communication. First, a malicious third-party, Eve, can tap into the network and sniff all the data passing between Alice and Bob. Second, a malicious party like Charlie can pretend to be Bob and intercept all communication between Alice and Bob.

The HTTPS protocol addresses the above security threats with two techniques. First, HTTPS encrypts all the communication between Alice and Bob to prevent Eve wiretapping the communication. Second, HTTPS requires all participants to prove their identities by presenting their certificates. A certificate works likes a government-issued passport, which includes the full name of the participant, the organization, etc. A trusted certificate authority (CA) signs the certificate to ensure that it is authentic. In our example, Alice verifies the certificate from the remote participant to ensure that she is indeed talking to Bob.

Behind the scene, public-key cryptography is the key technology that enables HTTPS. In public-key cryptography, each party has a paired private key and a public key. Public-key cryptography has an interesting property: one can use either the public or the private key for encryption, but decrypting the data requires the other key in the key pair. It is easy to see that public-key cryptography can implement encryption. Moreover, public keys can be used as a proof of identity because they are notarized by the CA. An in-depth technical introduction to public cryptography can be found here.

The next section describes how to deploy HTTPS in your Apache Hadoop cluster.

hdfs_1

Figure 1: Basic workflow for HTTPS communication. Alice (left) and Bob (right) communicate through an insecure channel. The HTTPS protocol specifies how to secure the communication through cryptography to verify the identities of the participants (i.e., Alice is indeed talking to Bob) and to prevent wiretapping.

3. Deploying HTTPS in HDFS

3.1 Generating the key and the certificate for each machine

The first step of deploying HTTPS is to generate the key and the certificate for each machine in the cluster. You can use Java’s keytool utility to accomplish this task:
$ keytool -keystore {keystore} -alias localhost -validity {validity} -genkey

You need to specify two parameters in the above command:

  • keystore: the keystore file that stores the certificate. The keystore file contains the private key of the certificate; therefore, it needs to be kept safely.
  • validity: the valid time of the certificate in days.

The keytool will ask for more details of the certificate:
Screen Shot 2014-08-29 at 1.34.04 PM

Ensure that common name (CN) matches exactly with the fully qualified domain name (FQDN) of the server. The client compares the CN with the DNS domain name to ensure that it is indeed connecting to the desired server, not the malicious one.

3.2 Creating your own CA

After the first step, each machine in the cluster has a public-private key pair, and a certificate to identify the machine. The certificate, however, is unsigned, which means that an attacker can create such a certificate to pretend to be any machine.

Therefore, it is important to prevent forged certificates by signing them for each machine in the cluster. A certificate authority (CA) is responsible for signing certificates. CA works likes a government that issues passports—the government stamps (signs) each passport so that the passport becomes difficult to forge. Other governments verify the stamps to ensure the passport is authentic. Similarly, the CA signs the certificates, and the cryptography guarantees that a signed certificate is computationally difficult to forge. Thus, as long as the CA is a genuine and trusted authority, the clients have high assurance that they are connecting to the authentic machines.

In this blog we use openssl to generate a new CA certificate:

Screen Shot 2014-08-29 at 1.57.21 PM

The generated CA is simply a public-private key pair and certificate, and it is intended to sign other certificates.

The next step is to add the generated CA to the clients’ truststore so that the clients can trust this CA:

$ keytool -keystore {truststore} -alias CARoot -import -file {ca-cert}

In contrast to the keystore in step 3.1 that stores each machine’s own identity, the truststore of a client stores all the certificates that the client should trust. Importing a certificate into one’s truststore also means that trusting all certificates that are signed by that certificate. As the analogy above, trusting the government (CA) also means that trusting all passports (certificates) that it has issued. This attribute is called the chains of trust, and it is particularly useful when deploying HTTPS on a large Hadoop cluster. You can sign all certificates in the cluster with a single CA, and have all machines share the same truststore that trusts the CA. That way all machines can authenticate all other machines.

3.3 Signing the certificate

The next step is to sign all certificates generated by step 3.1 with the CA generated in step 3.2. First, you need to export the certificate from the keystore:

$ keytool -keystore -alias localhost -certreq -file {cert-file}

Then sign it with the CA:

$ openssl x509 -req -CA {ca-cert} -CAkey {ca-key} -in {cert-file} -out {cert-signed} -days {validity} -CAcreateserial -passin pass:{ca-password}

Finally, you need to import both the certificate of the CA and the signed certificate into the keystore:

$ keytool -keystore -alias CARoot -import -file {ca-cert}
$ keytool -keystore -alias localhost -import -file {cert-signed}

The definitions of the parameters are the following:

  • keystore: the location of the keystore
  • ca-cert: the certificate of the CA
  • ca-key: the private key of the CA
  • ca-password: the passphrase of the CA
  • cert-file: the exported, unsigned certificate of the server
  • cert-signed: the signed certificate of the server

3.5 Configuring HDFS

The final step is to configure HDFS to use HTTPS. First, you need to specify dfs.http.policy in hdfs-site.xml to start the HTTPS server in the HDFS daemons.


<property>
<name>dfs.http.policy</name>
<value>HTTP_AND_HTTPS</value>
</property>

Three values are possible:

  • HTTP_ONLY: Only HTTP server is started
  • HTTPS_ONLY: Only HTTPS server is started
  • HTTP_AND_HTTPS: Both HTTP and HTTPS server are started

One thing worth noting is that WebHDFS is no longer available when dfs.http.policy is set to HTTPS_ONLY. You will need to use WebHDFS over HTTPS (swebhdfs, contributed by Hortonworks’ Jing Zhao and me) in this configuration, which protects your data that is transferred through webhdfs.

Second, you need to change the ssl-server.xml and ssl-client.xml to tell HDFS about the keystore and the truststore.

ssl-server.xml


<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.keystore.keypassword</name>
<value><password of keystore></value>
</property>
<property>
<name>ssl.server.keystore.location</name>
<value><location of keystore.jks></value>
</property>
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.location</name>
<value><location of truststore.jks></value>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value><password of truststore></value>
</property>

ssl-client.xml


<property>
<name>ssl.client.truststore.password</name>
<value><password of truststore></value>
</property>

<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.truststore.location</name>
<value><location of truststore.jks></value>
</property>

The names of the configuration properties are self-explanatory. You can read more information about the configuration here. After restarting the HDFS daemons (NameNode, DataNode and JournalNode), you should have successfully deployed HTTPS in your HDFS cluster.

Closing Thoughts

Deploying HTTPS can improve the security of your Hadoop cluster. This blog describes how HTTPS works and how to set it up for HDFS in your Hadoop cluster. It is our mission to improve the security of Hadoop and to protect your valuable data.

Discover and Learn More

Tags:

Comments

  • To achieve this, can we ignore spnego related and HTTP auth fliters..? Or do we need to configure these also….As I faced when messed with all these,I am asking..

  • Hi,
    Is this topic is similar to enabling SSL encryption (one of the method in wire-to-wire encryption) on HDP ? If not, how is this different ?

  • Thanks for this helpful article. Here are some suggestions on improvements.

    Step or Section 3.3 gives several examples of the keytool command with the -keystore option but none of them have a filename argument for the keystore file. I don’t think that works, does it? It should be -keystore {keystore}.

    It would also be helpful to know the existing or default keystore used by HDFS. This can be configured in ssl-server.xml, I believe. Section 3.5 mentions that users should edit the settings in ssl-server.xml and ssl-client.xml to point at the keystores created with the other commands, but users should be careful to look at any existing settings in those files and their corresponding keystores before changing them. I think it would be better to point users to find their existing settings in case they have already been configured or partially configured for SSL use.

    It would be useful to have some comments on the aliases used for the stored certs. If users can set their own aliases, then comments in this article may not be needed, but if the alias for a CA cert or a host cert need to follow a convention or names specified elsewhere, that should be mentioned. If both the CA cert and the host cert are in the same keystore, they must have different aliases and perhaps those aliases are used to find the different certs for their different roles.

    In Section 3.5, Configuring HDFS, you provide a sample of the ssl-server.xml file. This has an entry for ssl.server.keystore.keypassword but not one for
    ssl.server.keystore.password. Aren’t both needed for certs with encrypted keys?

  • Hello, i would like to ask if having hadoop version of the requirements about the configuration of SSL on hadoop , and secondly, configure SSL and kerberos conflict?

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>