cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Securing your Data Lake Resource & Auditing User Access with HDP Advanced Security

Lab 1: Securing HDFS, Hive and HBase Data using Apache Ranger

Introduction

In this tutorial we will explore how you can use policies in Apache Ranger to protect your enterprise data lake and audit access by users to resources on HDFS, Hive and HBase from a centralized Ranger Administration Console.

Prerequisites

Outline

1: Start HBase and Ambari Infra Services

Go to Ambari and login with user credentials raj_ops/raj_ops. If HBase is switched off go to Service Actions button on top right and Start the service

start_hbase

Check the box for Maintenance Mode.

hbase_maintenance_mode

Next, click Confirm Start. Wait for 30 seconds and your HBase will start running.
Similarly, start Ambari Infra to record all the audits through Ranger. Your Ambari dashboard should look like this:

ambari_dashboard_rajops_infra

2: Login to Ranger Administration Console

Once the VM is running in VirtualBox, login to the Ranger Administration console at http://localhost:6080/ from your host machine. The username is raj_ops and the password is raj_ops.

ranger_login_rajops

As soon as you login, you should see list of repositories as shown below:

list_repositories

3: Review Existing HDFS Policies

Please click on Sandbox_hadoop link under HDFS section

sandbox_hadoop_policies

User can review policy details by a single click on the box right to the policy. Click on the HDFS Global Allow policy. Click the slider so it is in disable position.

click_hdfs_global_allow_disable

Then click Save.

4. Exercise HDFS Access Scenarios

Login to the Ambari by the following credentials:

Username – raj_ops
Password – raj_ops

Click on 9 square menu icon and select Files view:

select_files_view

You will see a home page like the one given below. Click on demo folder

files_view_home_page

Next, click on data. You will see a message like this:

demo_data_error

Click on details, that will lead you to the page that shows the permission denied for the user raj_ops:

demo_data_message

Go back to Ranger and then to the Audits Tab and check that its event (denied) being audited. You can filter by searching Result as Denied

audit_results_hdfs

Now, go back to the HDFS Global Allow Policy. Click the switch to enable it and try running the command again

click_hdfs_global_allow_enable

Click Save.
Now let us go back to Files view and Navigate back to /demo/data/. You will see three folders under data due to enabled HDFS global policy.

Now head back to the Audit tab in Ranger and search by User: raj_ops. Here you can see that the request was allowed through

audit_result_hdfs_allowed

5. Review Hive Policies

Click on Access Manager=>Resource Based Policies section on the top menu, then click on Sandbox_hive link under HIVE section to view list of Hive Policies:

sandbox_hive_policies

User can review policy details by a single click on the box right to the policy.
Disable the Hive Global Tables Allow Policy :

click_hive_global_allow_disable

Also disable the policy for raj_ops, holger_gov, maria_dev and amy_ds.
You should see a page like this:

sandbox_hive_policies_disabled

6. Exercise Hive Access Scenarios

Go back to Ambari and click on 9 square menu icon and select Hive view:

select_hive_view

Run the following query:

select * from foodmart.product;

You will come across error message which states that Permission denied for raj_ops because it does not have a SELECT privilege.

foodmart_product_message

Next, go back to Ranger and then Audits and see its access (denied) being audited. You can do this the same way that we checked for the raj_ops user. Just search the audit log by user to see.

audit_results_hive

Re-Enable the Global Hive Tables Allow policy and Policy for raj_ops, holger_gov, maria_dev and amy_ds.
Go back to Hive View and run the same query again:

select * from foodmart.product;

foodmart_product_successful

This time, the query runs successfully and you can see all data in product table. Go back to Ranger and then Audits to see its access (granted) being audited.

audit_results_hive_allowed

7. Review HBase Policies

Click on Access Manager=>Resource Based Policies section on the top menu, then click on the Sandbox_hbase to view list of hbase Policies.

sandbox_hbase_policies

User can review policy details by a single click on the box right to the policy. Disable the HBase Global Allow Policy in the same manner that we did before.

8. Exercise HBase Access Scenarios

First you’re going to need to log in to your Sandbox via SSH. If you’re using Virtualbox you can log in with the command:

ssh root@127.0.0.1 -p 2222

The first time password to log in is: hadoop

sshTerminal

Login into HBase shell as raj_ops user:

su raj_ops
hbase shell

hbase_shell_rajops

Run the hbase shell command to validate access for raj_ops user-id, to see if he can view table data from the iemployee table:

get 'iemployee', '1'

Then you should get an Access Denied Exception like:

iemployee_message

Let us check the audit log in Ranger too:

audit_results_hbase

Next, enable the HBase Global Allow Policy.
After making a change in the policy, go back to HBase shell and run the same query again:

get 'iemployee','1'

iemployee_successful

Now, you can view all the data in iemployee table under rowkey 1. Go to Ranger to check audit logs:

audit_results_hbase_allowed

9. Summary

Hopefully by following this tutorial, you got a taste of the power and ease of securing your key enterprise resources using Apache Ranger.

Happy Hadooping!!!

Lab 2: Row Level Filtering and Dynamic Column Masking in Apache Hive using Apache Ranger

Introduction

In this lab of the Apache Ranger tutorial, we will use Ranger to set access policies for row level filtering in Apache Hive tables. We will also cover Ranger masking capabilities to protect sensitive data like SSN or salary.

Prerequisites

Outline

1: Download the sample data

Download the driver data file from here.
Once you have the file you will need to unzip the file into a directory. We will be uploading two csv files – drivers.csv and truck_event_text_partition.csv.

2. Upload the data files

Login to the Ambari by the following credentials:
Username – raj_ops
Password – raj_ops
Click on 9 square menu icon and select Files view:

select_files_view

Navigate to /user/raj_opsand click on the Upload button to select the files we want to upload into the Hortonworks Sandbox environment.

upload_button_filesview

Click on the browse button to open a dialog box. Navigate to where you stored the drivers.csv file on your local disk and select drivers.csv and click again upload. Do the same thing for truck_event_text_partition.csv. When you are done you will see there are two new files in your directory.

uploaded_files

3. Create the tables in Hive

Let’s open the Hive View by clicking on the Hive button in the top bar as previously when we selected the HDFS Files view.

select_hive_view

Next, run the following query to create the drivers table:

create table drivers
(driverId int,
 name string,
 ssn bigint,
 location string,
 certified string,
 wageplan string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");

Click on green Execute button, after few seconds, you will see the SUCCEEDED message:

create_table_drivers

Time to load the data into this newly created table, type:

LOAD DATA INPATH '/user/raj_ops/drivers.csv' OVERWRITE INTO TABLE drivers;

And then click Execute:

load_data_drivers

Similarly, let us create the table timesheet and then load the data using following commands:

create table truck_events
(driverId int,
truckId int,
eventTime string,
eventType string,
longitude double,
latitude double,
eventKey string,
correlationId bigint,
driverName string,
routeId int,
routeName string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");

Next, load the data:

LOAD DATA INPATH '/user/raj_ops/truck_event_text_partition.csv' OVERWRITE INTO TABLE truck_events;

Next, click on Refresh sign next to Database explorer, you will see two new tables created.

database_explorer

Click on the box next to the table name to view the data:

select_drivers_data

Now that we have the tables ready, let us explore the row level filtering and masking capabilities of Apache ranger.

4. Row level filtering in Hive

By providing row level filtering to Hive tables, Hive data access can be restricted to specific rows based on user role in the organization. Row-level filter policies are similar to other Ranger access policies. You can set filters for specific users, groups, and conditions. The filter expression should be a valid WHERE clause. Let us start with some use cases of row level filtering.

4.1 Restrict access as per different types of wage plan in drivers table

There are only two values for wage plan attribute in drivers table – miles and hours. We are going to use two users for this use case – maria_dev and amy_ds. Let us restrict these users to access records based on the wage plan. User maria_dev should only view records corresponding to hours wage plan and amy_ds should view records corresponding to miles wage plan.

Go back to Ranger UI with the user credentials raj_ops/raj_ops. Click on Access Manager=>Resource Based Policies=>Sandbox_hive. Switch to Row level Filter tab.

click_row_level_filter

Click on Add New Policy button on the right. Enter the following details:

Policy Name - Row Filter to access drivers data
Hive Database - default
Hive Table - drivers

In Row filter Conditions:

Select Group - Leave it blank
Select User - maria_dev
Access Type - select
Row level filter - wageplan='hours'

NOTE – Do not forget to click on tick mark to save the row filter expression.

Next, click on + button to add one more row for amy_ds user, enter the following details in the row filter conditions:

Select Group - Leave it blank
Select User - amy_ds
Access Type - select
Row level filter - wageplan='miles'

Your Row Filter Conditions table should look like this:

row_filter_conditions

Please verify whether you have put the correct values.

row_level_filter_policy1

Click Add. Next, login to Ambari as maria_dev user.

Credentials: maria_dev/maria_dev. Go to Hive View and let us try to see the data in drivers table. Click on default database and then the square next to drivers table:

select_data_drivers_maria_dev

You can clearly see that there are only records where the wage plan is hours. Next, sign out from Ambari and then re-login as amy_ds user.

Credentials: amy_ds/amy_ds. Repeat the same operation as above

select_data_drivers_amy_ds

There are records corresponding to miles wage plan only.

4.2 Restrict Access on the drivers table data based on truck_events table data

Let us create one more Row level filter policy which makes sure that maria_dev view records only for those drivers whose route is from Saint Louis to Memphis. The information for route name is present in another table truck_events.

We have to edit the same policy which we created in the previous use case. Go back to the policy – Row filter to access drivers data and edit the first row of Row Filter Conditions like this:

Row level filter expression - wageplan='hours' AND driverid in (select t.driverid from truck_events t where t.routename = 'Saint Louis to Memphis')

This makes sure that maria_dev only view those records whose wageplan is hours and the route name is ‘Saint Louis to Memphis’. Your condition section should look like this:

row_level_conditions_1

Click Add and go back to Hive View as maria_dev user. Run the query to view records in drivers table.

select_data_drivers_maria_dev_1

5. Dynamic Column Masking in Hive

Column masking policy allows ranger to specify masking condition in hive policy to mask the sensitive data for specific users. A variety of masking types are available, such as show last 4 characters, show first 4 characters, Hash, Nullify, and date masks (show only year). Let us use drivers table data for data masking use cases as well.

5.1 Show only last 4 digits of SSN column

Go back to Ranger and click on Masking tab:

click_masking

Click on Add New Policy on the right . Enter the following details:

Policy Name - Masking in ssn column of drivers data
Hive Database - default
Hive Tables - drivers
Hive Column - ssn

In mask conditions,

Select Group - Leave blank
Select User - maria_dev
Access Type - select
Select Masking Options - Partial mask: show last 4

Your mask conditions section should look like this:

mask_conditions_ssn

And the entire policy should look like:

mask_policy_1

Click Add. Now wait for 20 seconds and go back to Hive view to see the records of driver table.

select_data_drivers_mask_1

Ranger replaces the first 5 digits of SSN to value 1 and retained the last 4.

NOTE: If SSN would have been in a proper format(111-22-3333) it would have given xxx-xx-3333

5.2 Convert location column values to hash values

In this use case, we will see how Ranger effectively hashes the values of location column through the masking policy.

Go back to Masking tab of Ranger console and click on Add New Policy button. Enter the following details:

Policy Name - Masking in location column of drivers data
Hive Database - default
Hive Tables - drivers
Hive Column - location

In mask conditions,

Select Group - Leave blank
Select User - maria_dev
Access Type - select
Select Masking Options - Hash

Your mask conditions section should look like this:

mask_conditions_location

And the overall policy should look like this:

mask_policy_2

Click Add. Now wait for 20 seconds and go back to Hive view to see the records of driver table.

select_data_drivers_mask2

Please note that the location column values has been hashed by random alphanumeric characters.

Next, let us see the data from the another user amy_ds. Sign out from Ambari and re-login as amy_ds. Credentials are amy_ds/amy_ds.

select_data_drivers_amy_ds

There is no masking in either SSN or location because amy_ds is not a part of the policy so it gets unmasked results.

6. Summary

In this tutorial, we learned how to create row level filter and masking policies in Apache Ranger to restrict access to Hive tables or columns.