Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
January 05, 2018
prev slideNext slide

4 essential steps for managing sensitive data in your data lake

By: Balaji Ganesan, CEO of Privacera

How to leverage data discovery, control, anonymization and monitoring using Privacera, Apache Atlas and Ranger

Data is growing in data lakes, so are security and compliance risks. These risks stem from storing and processing sensitive data.

Forrester defines toxic data (its definition of sensitive data) as a combination of 3P + IP. 3Ps being PII, PHI and PCI data while IP refers to intellectual property. Essentially, sensitive data carries the biggest risk if the data gets compromised, leaked or accessed inappropriately.

So how do companies manage sensitive data in their growing data lakes?
Option A – Do not bring any sensitive data into the data lake. This option limits the exposure but also limits the use cases that could be built on the data lake.
Option B – Bring in any kind of data into the data lake but institute rigorous standards for managing data risks. This option unlocks the power of big data but also require teams spend time in building security and governance standards

This blog is intended for companies planning to ingest any kind of data into the data lake and enabling business team to use such data. Here are the four essential steps data teams should follow to manage sensitive data and potential security and compliance risks:

  1. Incorporate automated data classification. Without proper data classification, security and governance teams cannot institute proper controls or get visibility into risks. Data classification is the foundation for the modern data lake.
  2. Access control. Companies need to put in controls to restrict access to sensitive data and ensure policies are granted on an as-needed basis
  3. Anonymization. To reduce exposure, constitute policies to anonymize data as data is ingested into the data lake. Different methods for data protection are available for different use cases
  4. Monitoring. Big data provides power to users to combine, transform and move data. Institute monitoring to detect any potential data loss or a behavior leading to a compliance or security violation.

Privacera is a fast growing data security and governance startup and a leading Hortonworks partner. Privacera platform integrates with Apache Atlas and Apache Ranger and extends the security controls available in HDP to provide a comprehensive functionality for data teams to manage sensitive data.


Here is how Privacera + HDP can help with the 4 steps outlined above to effectively manage data related risks.

Automated Data Classification

Privacera incorporates machine learning and NLP along with inbuilt rules to precisely discover sensitive data and classify them. Privacera connects to HDFS, Hive as well as other data stores, and can analyze content, context and metadata to precisely identify and classify any data. Privacera can scan structured and unstructured data, as it lands into the data lake or when it is stored in HDFS.

Privacera then pushes the metadata into Apache Atlas. Tags and associated metadata can now be searched and queried through Atlas UI or APIs.

Access Control

Once data is ingested into the data lake, fine grained access control policies need to be implemented in the data lake to ensure users get access to data only on a as-needed basis.
Using Apache Ranger, data teams can construct policies based on data sensitivity levels. Through the Apache Atlas and Ranger integration, metadata discovered by Privacera is pushed into Apache Atlas and Ranger. Administrators can then construct tag based policies for enabling or restricting access to any sensitive data.


Compliance and privacy regulations mandate the personal information be anonymized and encrypted at rest and while being accessed. As data lakes grow, sensitive data may need to be anonymized to reduce exposure and manage risks with compliance and security.

Privacera extends the dynamic anonymization feature available in Ranger with ability to apply format preserving encryption and tokenization capabilities. Privacera can help with:

  1. Anonymizing or tokenizing data as it is ingested or while it is stored within the data lake. Privacera can help with preserving the format of data so that data can be used for analytics while preserving the confidentiality and privacy.
  2. Anonymize or de-anonymize data only for specific users depending on the business need.


Beyond access policy enforcement, auditing all user activities for compliance and legal purposes is recommended steps. Audit data is often used by compliance and security teams to analyze how users are using data. As data and user bases grow, it can be challenging for compliance teams to manually analyze reports and audit logs to measure adherence to a compliance and legal regulations.

Privacera collects and analyzes audit data and monitors the data use across various parameters.Privacera monitoring can detect security risks and compliance violations proactively. Privacera monitoring module stitches together user information and detects data movements or unusual user behavior. The end result are alerts that can be viewed by administrators in the Privacera portal and can take appropriate action.


Data lakes are growing and data teams are embracing new use cases. Enterprises should embrace security and governance best practices while building the data lake. Data teams must look at automated data classification, building controls based on data content, and implementing data protection and monitoring to ensure sensitive data is protected at all times.

Join Hortonworks and Privacera for a webinar expanding on securing data lakes on January 24, at 11am PST. Register Now

For more information, please visit us at or reach out through email at

Leave a Reply

Your email address will not be published. Required fields are marked *