New in HDP2: Encrypted communication with Hive between Hadoop and Analytics Tools
Security is one of the biggest topics in Hadoop right now. Historically Hadoop has been a back-end system accessed only by a few specialists, but the clear trend is for companies to put data from Hadoop clusters in the hands of analysts, marketers, product managers or call center employees whose numbers could be in the hundreds or thousands. Data security and privacy controls are necessary before this transformation can occur. HDP2, through the next release of Apache Hive introduces a very important new security feature that allows you to encrypt the traffic that flows between Hadoop and popular analytics tools like Microstrategy, Tableau, Excel and others.
This blog will explore this topic in more detail, as well as show you how you can configure this feature and try it out for yourself today.
Architecture of Hadoop Usage
Analytics tools like Tableau execute queries on Hadoop through a component called HiveServer2. HiveServer2 provides ODBC and JDBC connectivity to Hadoop and effectively serves as a gateway through which SQL queries are routed to Hadoop. This makes HiveServer2 a convenient single point that, when secured, ensures data privacy to analytics users.
Since its beginning, HiveServer2 has offered options for authentication to secure clusters via Kerberos, as well as the ability to run Hive queries as the authenticated user (the so-called doas feature). Until now, however, all communication between the Hadoop cluster has been unencrypted. This was a problem for anyone who needed to expose sensitive data outside their secure environment.
One customer that felt this pain very keenly was Yahoo, who are busily deploying BI tools to their analysts. The lack of encryption would be a show-stopper in Yahoo to the extent that Arup Malakar and Chris Drome from Yahoo implemented HIVE-4991, adding SASL QoP support to HiveServer2, allowing encryption to be required through a server-side variable.
Try it yourself
If your interest is piqued you can try it for yourself today and might even be surprised to find that you can be up and running in just a few minutes. HiveServer2 encryption is included as part of our HDP 2.0 Beta, so you can try it for yourself.
Here’s how you can try it out for yourself:
Step 1: Install HDP in secure mode.
A kerberized cluster is required for this feature. Visit hortonworks.com/download to get started with HDP 2.0 Beta. Of course, the feature will also be part of HDP 2.0 GA and future 2.x versions of HDP.
Step 2: Configure HDP to negotiate encrypted connections.
We’ll use Ambari to make this configuration change, so start by logging in to Ambari.
Step 2.1: Select the Hive/HCat service under Services.
Step 2.2: Stop Hive. This is necessary to make configuration changes.
Step 2.3: Confirm Stopping Hive.
Step 2.4: When Hive is stopped, select OK.
Step 2.5: Configs tab.
Step 2.6: Select Custom hive-site.xml
Step 2.7: Select Add Property…
Step 2.8: Enter Key hive.server2.thrift.sasl.qop, Value auth-conf.
Step 2.9: Save the new property.
Step 2.10: Start Hive.
Step 3: Install and configure the Hortonworks Hive ODBC Driver.
Download the Hortonworks Hive ODBC Driver from our add-ons page and follow the installation instructions for your platform. If you are using Mac, the Hortonworks Sandbox 1.3 has a tutorial that shows exactly how to install the ODBC driver on Mac, which I recommend following due to the complexity of the install. The Sandbox helps with ODBC setup but you will need a full cluster to try the encryption feature because the Sandbox doesn’t support Kerberos. Once you have installed it, define an ODBC data source to your cluster.
Step 5: Configure Kerberos authentication for your client
Your client will need a Kerberos ticket to continue. Obtaining the ticket depends on your OS. If using Windows, Appendix A of the ODBC user guide guides you through the process. Other systems will usually have a kinit program pre-installed.
Step 6: Securely connect your favorite analytics tool to Hadoop.
At this point all that remains is to use the ODBC connection you’ve configured from your analytics tool of choice. Based on the HiveServer2 configuration, all communications will be done encrypted.
More Security Goodness
We’re also excited to announce that in addition to Kerberos authentication, HDP 2 will also support LDAP authentication in HiveServer2. Many customers who don’t want to go through the process of fully “Kerberizing” their clusters find this an easier alternative that still meets their authentication needs.
HDP 2 brings critical improvements in authentication and privacy that are essential to enabling broad-based consumption of Hadoop. We welcome you to try it out today in the HDP Beta or in the HDP Beta Sandbox and give us your feedback.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.