Data Governance & Integration

Simplified Data Processing for Enterprise Hadoop

Hand-coding data processing pipelines for Hadoop can be tedious and time consuming. A processing application needs to handle the data transformation logic, the replication logic, and the retention logic, not to mention the orchestration, scheduling and retry logic across workflows. Often, pipeline processing involves datasets that span clusters and sometimes even data centers. This adds to the complexity.

The solution to this problem goes beyond providing a simple SDK or a new Java library. Certainly, those items can help to improve developer efficiency when writing MapReduce code. But we believe in tackling the Hadoop pipeline challenge in a way that promotes reuse and consistency. This requires a more declarative approach. And any solution must work with components of Hadoop that are already known and trusted.

Our objective is to provide a data governance solution centered around Apache Falcon that makes it easier to build and automate the execution of complex pipelines. Falcon enforces reuse and consistency at its core to enable tracing and data provenance. And while Falcon leverages the existing components of Hadoop (such as Apache Sqoop and Apache Flume for data integration), it is also flexible enough to support new ecosystem projects in the future.

Initiative Goals

Modular
Provide data processing “building blocks” for describing data pipelines. Define a dataset and process once, use it again and again.
Automated
Automate processing across datasets, clusters and data centers and handle process orchestration and scheduling in a consistent way.
Traceable
Follow a dataset path through processing pipelines and across clusters. Consistency and reuse promote traceability.

Status

The team at InMobi and engineers from Hortonworks initiated the Apache Falcon incubation project in April 2013. Since then, Hortonworks has worked with InMobi and the community to make Falcon a deeply integrated component of Hadoop.

Beginning with Hortonworks Data Platform version 2.1, Apache Falcon is a fully certified component of HDP, for centralized monitoring of data pipelines.

Apache Falcon version 0.5 will capture data pipeline lineage information and provide access to it through the user interface and API. It will also allow users to throttle bandwidth used by data replication jobs and more closely monitor data pipelines.

Future releases will include data pipeline audits, providing cluster administrators information about who modified a dataset and when. The community will also add data pipeline lineage, which will help to analyze how a dataset reached a particular state.

Essential Timeline

Phase 1
  • Incubate Apache Falcon
  • Dataset Replication
  • Dataset Retention
  • Falcon Tech Preview
Q4 2013
Phase 2
  • Basic Pipeline Dashboard
  • Kerberos Security Support
  • Hive/HCatalog Integration
  • Ambari Integration for Management
Preview Available
Falcon 0.5.0(HDP 2.1)
Phase 3
  • Advanced Pipeline Management Dashboard
  • Audit
  • Dataset Lineage
  • Data Tagging
  • File Import – SSH & SCP

Recently in the Blog

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.

Thank you for subscribing!