Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
April 14, 2015
prev slideNext slide

Apache Atlas Project Proposed for Hadoop Governance

Enterprises across all major industries adopt Apache Hadoop for its ability to store and process an abundance of new types of data in a modern data architecture. This “Any Data” capability has always been a hallmark feature of Hadoop, opening insight from new data sources such as clickstream, web and social, geo-location, IoT, server logs, or traditional data sets from ERP, CRM, SCM or other existing data systems.

But this means that enterprises adopting modern data architecture with Hadoop must reconcile data management realities when they bring existing and new data from disparate platforms under management. As customers deploy Hadoop into corporate data and processing environments, metadata and data governance must be vital parts of any enterprise-ready data lake.

For these reasons, we established the Data Governance Initiative (DGI) with Aetna, Merck, Target, and SAS to introduce a common approach to Hadoop data governance into the open source community. Since then, this co-development effort has grown to include Schlumberger. Together we work on this shared framework to shed light on how users access data within Hadoop while interoperating with and extending existing third-party data governance and management tools.


A New Project Proposed to the Apache Software Foundation: Apache Atlas

I am proud to announce that engineers from Aetna, Hortonworks, Merck, SAS, Schlumberger, Target and others have submitted a proposal for a new project called Apache Atlas to the Apache Software Foundation. The founding members of the project include all the members of the DGI and others from the Hadoop community.

Apache Atlas proposes to provide governance capabilities in Hadoop that use both a prescriptive and forensic models enriched by business taxonomical metadata. Atlas, at its core, is designed to exchange metadata with other tools and processes within and outside of the Hadoop stack, thereby enabling platform-agnostic governance controls that effectively address compliance requirements.


The core capabilities defined by the project include the following:

  • Data Classification – to create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources
  • Centralized Auditing – to provide a framework for capturing and reporting on access to and modifications of data within Hadoop
  • Search and Lineage – to allow pre-defined and ad-hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed
  • Security and Policy Engine – to protect data and rationalize data access according to compliance policy.

The Atlas community plans to deliver those requirements with the following components:

  1. Flexible Knowledge Store,
  2. Advanced Policy Rules Engine,
  3. Agile Auditing,
  4. Support for specific data lifecycle management workflows built on the Apache Falcon framework, and
  5. Integration and extension of Apache Ranger to add real-time, attribute-based access control to Ranger’s already strong role-based access control capabilities.

Why Atlas?

Atlas targets a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop while ensuring integration with the whole data ecosystem. Apache Atlas is organized around two guiding principals:

  • Metadata Truth in Hadoop: Atlas should provide true visibility in Hadoop. By using both a prescriptive and forensic model, Atlas provides technical and operational audit as well as lineage enriched by business taxonomical metadata. Atlas facilitates easy exchange of metadata by enabling any metadata consumer to share a common metadata store that facilitates interoperability across many metadata producers.
  • Developed in the Open: Engineers from Aetna, Merck, SAS, Schlumberger, and Target are working together to help ensure Atlas is built to solve real data governance problems across a wide range of industries that use Hadoop. This approach is an example of open source community innovation that helps accelerate product maturity and time-to-value for the data-first enterprise.

Stay Tuned for More to Come

The proposal of Apache Atlas represents a significant step on the journey to addressing data governance needs for Hadoop completely in the open.

Seetharam Venkatesh and I are talking about the Data Governance Initiative and the Apache Atlas project proposal at Hadoop Summit Summit Europe in Brussels.

After the event, we will be back with more on the exciting progress of Apache Atlas!


  • This is such an important and ambitious component for open source – we have discussions about once a week in my bank around trying to solve exactly these sorts of data governance problems around Hadoop as well as other systems. If Hadoop gets stronger in this area I’m sure the gravity of the platform will increase as well.

    I hope it comes quickly but even more importantly I hope it’s really ready, robust and well coded by the time it’s released.

  • Fantastic initiative! I echo Hari’s comments above. This is a crucial component in preventing the ‘data-lake’ becoming a swamp.

  • We tried and tested Altas there are a lot simple but effective solutions that can be built quickly.
    I am open to share and contribute.

    Through my years as consultant I have built and experienced a ton of data management solutions and have custom built some frameworks and solutions. Would be happy to contribute.

    • Hi Kashi,

      I am a lead developer looking for some documentation and sample solutions implemented in Atlas. Would you be able to help?

  • NOt sure this the right place :

    We would like to know the answer for following below question :

    1 Does Atlas support the tags for Spark, Pig and Sqoop?

    o If Atlas support Pig,Sqoop& Spark then is there any kind of customization that we have to perform to make it work Atlas.

    o From where we can gets the exact process if we want to implement the above Hadoop components with Atlas.

    2 Where does Atlas store the metadata?

    o How we can reach the schema of Atlas metadata.

    3 Does Atlas provide the feature to fetch metadata for other tools (e.g. Informatica) and integrate it with its own metadata?

    o If Atlas support the above feature then how we can integrate the informatica with Atlas metadata.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>