Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
March 10, 2015
prev slideNext slide

Best Practices for Hive Authorization Using Apache Ranger in HDP 2.2

Apache Hive is the de facto standard for SQL in Hadoop with more enterprises relying on this open source project than any other alternative. Stinger.next, a community based effort, is delivering true enterprise SQL at Hadoop scale and speed.

With Hive’s prominence in the enterprise, security within Hive has come under greater focus from enterprise users. They have come to expect fine grain access control and auditing within Hive. Apache Ranger provides centralized security administration for Hadoop, and it enables fine grain access control and deep auditing for Apache components such as Hive, HBase, HDFS, Storm and Knox.

This blog covers the best practices for configuring security for Hive with Apache Ranger and focuses on the use cases of data analysts accessing Hive, covering three scenarios:

  • Data analysts accessing only Hiveserver2, with limited access to HDFS files
  • Data analysts accessing both Hiveserver2, and HDFS files through Pig/MR jobs
  • Data analysts accessing Hive CLI

For each scenario, we will illustrate how to configure Hive and Ranger and discuss how security is handled. You can use either deployment: Sandbox or HDP 2.2 cluster installed using Apache Ambari. Note the pre-requisites below.

Prerequisites

  • HDP 2.2 Sandbox: If you are using the HDP 2.2 Sandbox, ensure that you disable the global “allow policies” in Ranger before configuring any security policies. The global “allow policy” is the default in the sandbox, to let users access Hive and HDFS without any permission checks.
  • OR

  • HDP 2.2 cluster: Ranger plugins for HDFS and Hive as well as Ranger admin installed manually (documentation for Ranger install can be found here).

Scenario 1 – HiveServer2 access with limited HDFS access

In this scenario, many analysts access data through HiveServer2, though specific administrators may have direct access to HDFS files.

Column level access control over Hive data is a major requirement. You can enable column level security access by following these steps:

Step 1. Hive Configuration

In Ambari –> Hive-> Config, ensure the hive.server2.enable.doAs is set to “false”. What this means is that Hiveserver2 will run MR jobs in HDFS as “hive” user. Permissions in HDFS files related to Hive can be given only to “hive” users, and no analyst would be able to access HDFS files directly.

hive_ranger_1

Step 2. Ranger configuration

With Ranger installed, you can configure a policy at a column level as shown below:

hive_ranger_2

In this example, the marketing group has only access to “phone number”, “plan” and “date” columns in the “customer_details” table.

Step 3.Run a query

You can use Hue or Beeline to run a query against this table. In this example from the sandbox, we have used user “mktg1” to run the query against this table.

hive_ranger_3

After successfully running the query, check the audit logs in Ranger

hive_ranger_4

You will see the query running in Hive as the original user (“mktg1” in this case), while the related tasks in HDFS will be executed as the “hive” user.

With Ranger enabled, the only way data analysts can view data would be through Hive and the access in Hive would be controlled at the column level. Administrators who need access at HDFS level can be given permissions through Ranger policies for HDFS or through HDFS ACLs.

Scenario 2 – Hiveserver2 and HDFS access

In this scenario, analysts use Hiveserver2 to run SQL queries while also running Pig/MR jobs that run directly on HDFS data. In this case, we would need to enable permissions within Hive as well as HDFS

As in previous scenarios, ensure that Hive and Ranger is installed and Ambari is up and running. If you are using the sandbox, ensure that any global policies in Ranger have been disabled.

Step 1. Configuration Changes: hive-site.xml or in Ambari → Hive → Config

In Ambari –> Hive-> Config, ensure the hive.server2.enable.doAs is set to “true”. What this means is that Hiveserver2 will run MR jobs in HDFS as the original user.

hive_ranger_5

Make sure to restart Hive service in Ambari after changing any configuration.

Step 2. In Ranger, within HDFS, create permissions for files pertaining to hive tables

In the example below, we will be giving the marketing team “read” permission to the file corresponding to the Hive table “customer_details”

hive_ranger_6

The users can access data through HDFS commands as well.

hive_ranger_7

Step 3. check the audit logs in Ranger

. You will see audit entries in Hive and HDFS with the original user’s ID.

hive_ranger_8

Scenario 3 – Hive CLI access

If the analysts use Hive CLI as the predominant method for running queries, we need to configure security differently.

Hive CLI loads hive configuration into the client and gets data directly from HDFS or through map reduce/Tez tasks. The best way to protect Hive CLI would be to enable permissions for HDFS files/folders mapped to the Hive database and tables. In order to secure metastore, it is also recommended to turn on storage-based authorization.

Please note that Ranger Hive plugin only applies to Hiveserver2. Hive CLI should be protected using permissions at the HDFS folder/file level using Ranger or HDFS ACLs.

  1. First identify the files corresponding to tables in Hive. You can look through the directory /apps/hive/warehouse
  2. Set permissions for this folder in Ranger -> HDFS Policies

    hive_ranger_9

  3. Run queries through Hive CLI

    sandbox ~]# su - mktg1
    [mktg1@sandbox ~]$ hive
    hive> use xademo;
    OK
    Time taken: 9.855 seconds
    hive> select phone_number from customer_details;
    OK
    PHONE_NUM
    5553947406
    7622112093
    5092111043
    9392254909
    7783343634

  4. Check audit entries in Ranger

    hive_ranger_10

  5. Run any DDL commands through Hive CLI.

    [root@sandbox ~]# su - it1
    [it1@sandbox ~]$ hive
    hive> use xademo;
    OK
    Time taken: 12.175 seconds
    hive> drop table customer_details;

    FAILED: SemanticException Unable to fetch table customer_details. java.security.AccessControlException: Permission denied: user=it1, access=READ, inode=”/apps/hive/warehouse/xademo.db/customer_details”:hive:hdfs:drwx——

  6. The action drop table is denied due to lack of permission at the HDFS level. It can be verified in the Ranger audit logs:

    hive_ranger_11

    Summary

    Hive will continue to evolve as the predominant application for accessing data within Hadoop. With Apache Ranger, you can configure policies to support fine grain access control in Hive and HDFS and secure your data from unauthorized access. Use this blog as a guide to configure security policies that best support your data access needs and use cases.

Tags:

Comments

  • When I create Hive tables from HDFS files through Waterline Data Inventory and Beeline (JDBC connections), I get a permission error that implies that the Hive superuser needs to have write access to the directory where the source HDFS file resides. That’s even if I make the Hive call with a user with access to both the HDFS files/directory and the Hive database. Using Ranger, I have to explicitly give ‘hive’ write access to the relevant directories in HDFS. Is that what you’d expect? or is there some configuration parameter somewhere that stops Hive from expecting to write in the source file’s directory? Yes, I’ve configured Hive to use impersonation/doAs. I’m surprised that the Hive superuser would have to have such broad write access.

  • You seem to suggest that either use impersonation+HDFS policies or turn off impersonation+Hive Policies.

    What should be my approach if I want to use hive impersonation (because the underlying data may be accessed via Hive and SparkSQL) but I’d also like to define policies for Hive (because it gives me a higher granularity at a table/column level instead of having to contend with files.) Can I not use impersonation+HDFS Policies+Hive Policies in Ranger?

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>