Data Governance Initiative

Centralized, Comprehensive Approach to Governance in Hadoop

More than ever businesses rely on the quality of their data for their day-to-day decision-making, business insights, or regulatory reporting. While Hadoop and the Modern Data Architecture have made it easier for organizations to scale and speed time to insight, it has also made governance concerns more urgent; never before has data been so diverse and complex to consume or manage.

At Hortonworks, we are acutely aware of the governance challenges within the Hadoop ecosystem of projects. Each project, Hive, HBase, Pig, etc has its own approach to process and metadata. There is no consistent centralized standard to tie things together. Further, there is no common conduit for Hadoop to work within the sphere of current data governance frameworks found in the traditional enterprise environment.

Defining Data Governance in the Modern Data Architecture

Together with Target, Merck, Aetna and SAS, we have spearheaded the Data Governance Initiative for Hadoop to address these key challenges and define a way forward for implementation of a centralized, comprehensive approach to governance within Hadoop and integrating Hadoop with current governance frameworks.

Initiative Goals

Governance standards & protocols must be clearly defined and available to all.
Reproducible & Audit
Recreate the relevant data landscape at a point in time and be traceable with appropriate historical lineage.
Compliance practices must be consistent within Hadoop and as a larger ecosystem of tools

Defining the goals

Together, the members of the initiative have outline key deliverables to support these goals that will be executed in a three-phase project.

Phase 1 – Key Deliverables

In phase 1, we have collected the baseline requirements that will allow governance to be applied across the Hadoop stack. Some of these key capabilities include:

  • Knowledge store
  • Integration with Ranger for security enforcement
  • Integration with Falcon for data life cycle
  • Deep, searchable and taggable Audit Store
  • Advanced rules based policy engine

Future Phases: Key Deliverables
(order and delivery schedule TBD)

  • Automated ingestion and tagging
  • Policy Engine Enhancements
  • Hive schema lineage at column level

Essential Timeline

Phase 1
  • Incubate Apache Falcon
  • Dataset Replication
  • Dataset Retention
  • Falcon Tech Preview
Q4 2013
Phase 2
  • Basic Pipeline Dashboard
  • Kerberos Security Support
  • Support for Windows Platform
  • Ambari Integration for Management
Delivered(HDP 2.1)
Phase 3
  • Advanced Pipeline Management Dashboard
  • Centralized Audit & Lineage
  • Dataset Lineage
  • Improved User Interface
  • Replicate to Cloud: Azure & S3
  • Hive/HCat Metastore Replication
  • HDFS Snapshots & Hive ACID Support
Phase 4
  • Visual Pipeline Design
  • File Import: SSH & SCP
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.