Securing your Hadoop Infrastructure with Apache Knox
In this tutorial we will walk through the process of
- Configuring Apache Knox and LDAP services on HDP Sandbox
- Run a MapReduce Program using Apache Knox Gateway Server
What is Apache Knox?
The Apache Knox Gateway is a system that provides a single point of authentication and access for Apache™ Hadoop® services. It provides the following features:
- Single REST API Access Point
- Centralized authentication, authorization and audit for Hadoop REST/HTTP services
- LDAP/AD Authentication, Service Authorization and Audit
- Eliminates SSH edge node risks
- Hides Network Topology
A working HDP cluster – the easiest way to have a HDP cluster is to download the Sandbox
Layers of Defense for a Hadoop Cluster
- Perimeter Level Security – Network Security, Apache Knox (gateway)
- Authentication : Kerberos
- OS Security : encryption of data in network and hdfs
Apache Knox accesses Hadoop Cluster over HTTP/HTTPs
Current Features of Apache Knox
- Authenticate : by LDAP or Cloud SSO Provider
- Provides services for hive, hbase, hdfs, oozie, hcat
- HTTP access for Hive over JDBC support is available (ODBC driver Support- In Future)
Installation and Setup:
HDP Sandbox 2.1 comes with Apache Knox installed.
You can run the following to find if knox processes are running:
ps -ef|grep knox
You will not see any process is running.
You can check various knox libraries, configuration files etc. by checking in the following directory:
ls -ltr /usr/lib/knox
Let’s see if LDAP processes are up and running.
ps -ef|grep ldap
You can check if the LDAP and Gateway servers started as follows:
Here is another way of starting LDAP and Gateway servers:
java -jar bin/ldap.jar conf &
java -jar bin/gateway.jar &
If you want to stop these services, you could use the following commands:
sudo -u knox bin/gateway.sh stop
sudo -u knox bin/ldap.sh stop
With installation, a user named
knox is created and you can see in the
Let’s check if the Hadoop Cluster is accessible via Webhdfs.
curl -iku guest:guest-password -X GET 'http://sandbox:50070/webhdfs/v1/?op=LISTSTATUS'
Now let’s check if we can access Hadoop Cluster via Apache Knox services.
curl -iku guest:guest-password -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS'
Let’s work on an End to End Implementation use case using Apache Knox Service. Here we will take a simple example of a mapreduce jar that you might be already familiar with, the WordCount mapreduce program. We will first create the needed directories, upload the datafile into hdfs and also upload the mapreduce jar file into hdfs. Once these steps are done, using Apache Knox service, we will run this jar and process data to produce output result.
You could create the directories(knox-sample, knox-sample/input, knox-sample/lib, knox-sample/output as follows:
cd /usr/lib/knox curl -iku guest:guest-password -X put 'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/knox-sample?op=MKDIRS&permission=777' curl -iku guest:guest-password -X put 'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/knox-sample/input?op=MKDIRS&permission=777' curl -iku guest:guest-password -X put 'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/knox-sample/lib?op=MKDIRS&permission=777' curl -iku guest:guest-password -X put 'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/knox-sample/output?op=MKDIRS&permission=777'
Let’s upload the data and the mapreduce jar files:
curl -iku guest:guest-password -L -T samples/hadoop-examples.jar \ -X PUT "https://localhost:8443\ /gateway/sandbox/webhdfs/v1/user/guest/knox-sample/lib/hadoop-examples.jar?op=CREATE" curl -iku guest:guest-password -L -T README \ -X PUT "https://localhost:8443\ /gateway/sandbox/webhdfs/v1/user/guest/knox-sample/input/README?op=CREATE"
Let’s run the mapreduce program.
curl -iku guest:guest-password --connect-timeout 60 -X POST \ -d arg=/user/guest/knox-sample/input -d arg=/user/guest/knox-sample/output \ -d jar=/user/guest/knox-sample/lib/hadoop-examples.jar \ -d class=org.apache.hadoop.examples.WordCount \ https://localhost:8443/gateway/sandbox/templeton/v1/mapreduce/jar
Here is the code as well, which I saved in a sh script for easier execution.
When you run the mapreduce execution step, you will see the following result. Please note down the Job Id. You will use it for checking status for this Job Id in the next step.
You can check the status of your above Job Id as follows:
curl -iku guest:guest-password 'https://localhost:8443/gateway/sandbox/templeton/v1/queue/job_1394770200462_004'
Let’s look at the list of directories in /knox-sample parent directory in hdfs.
curl -iku guest:guest-password -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/knox-sample?op=LISTSTATUS'
These are the directories which we created earlier.
Let’s look at the output result file.
curl -iku guest:guest-password -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/knox-sample/output?op=LISTSTATUS'
It should look something like below:
Let’s look at the output result.
curl -iku guest:guest-password -L -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/knox-sample/output/part-r-00000?op=OPEN'
You just ran a mapreduce program on Hadoop over Apache Knox Service.
Try this tutorial with :
These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.