Data Governance & Integration
Hand-coding data processing pipelines for Hadoop can be tedious and time consuming. A processing application needs to handle the data transformation logic, the replication logic, and the retention logic, not to mention the orchestration, scheduling and retry logic across workflows. Often, pipeline processing involves datasets that span clusters and sometimes even data centers. This adds to the complexity.
The solution to this problem goes beyond providing a simple SDK or a new Java library. Certainly, those items can help to improve developer efficiency when writing MapReduce code. But we believe in tackling the Hadoop pipeline challenge in a way that promotes reuse and consistency. This requires a more declarative approach. And any solution must work with components of Hadoop that are already known and trusted.
Our objective is to provide a data governance solution centered around Apache Falcon that makes it easier to build and automate the execution of complex pipelines. Falcon enforces reuse and consistency at its core to enable tracing and data provenance. And while Falcon leverages the existing components of Hadoop (such as Apache Sqoop and Apache Flume for data integration), it is also flexible enough to support new ecosystem projects in the future.
The team at InMobi and engineers from Hortonworks initiated the Apache Falcon incubation project in April 2013. Since then, Hortonworks has worked with InMobi and the community to make Falcon a deeply integrated component of Hadoop.
Apache Falcon is a fully certified component of HDP, for centralized monitoring of data pipelines.
The Apache Falcon community has already delivered these features:
Apache Falcon version 0.5 will capture data pipeline lineage information and provide access to it through the user interface and API. It will also allow users to throttle bandwidth used by data replication jobs and more closely monitor data pipelines.
Future releases will include data pipeline audits, providing cluster administrators information about who modified a dataset and when. The community will also add data pipeline lineage, which will help to analyze how a dataset reached a particular state.
- Incubate Apache Falcon
- Dataset Replication
- Dataset Retention
- Falcon Tech Preview
- Basic Pipeline Dashboard
- Kerberos Security Support
- Support for Windows Platform
- Ambari Integration for Management
- Advanced Pipeline Management Dashboard
- Centralized Audit & Lineage
- Dataset Lineage
- Improved User Interface
- Replicate to Cloud: Azure & S3
- Hive/HCat Metastore Replication
- HDFS Snapshots & Hive ACID Support
- Visual Pipeline Design
- File Import: SSH & SCP