Hand coding data processing pipelines for Hadoop can be very involved. A processing application needs to handle the data transformation logic, the replication logic, the retention logic, not to mention the orchestration, scheduling and retry logic across workflows. Further adding to the complexity is that often pipeline processing involves datasets that span clusters and sometimes even data centers.
Solving this problem goes beyond providing a simple SDK or new Java library. Certainly, those items can help to improve developer efficiency when writing MapReduce code. But we believe tackling the Hadoop pipeline challenge in a way that promotes reuse and consistency requires an approach that is more declarative. And any solution must work with the already known and trusted components of Hadoop.
Our goal is to provide a data processing solution centered around Apache Falcon that makes it easier to build and automate the execution of complex pipelines. Falcon enforces reuse and consistency at its core to enable tracing and data provenance. And while Falcon leverages the existing components of Hadoop, it is also flexible enough to support new ecosystem projects.
The Apache Falcon incubation project was initiated by the team at InMobi together with engineers from Hortonworks in April 2013. Hortonworks is committed to working with InMobi and the community to make Falcon a deeply integrated component of Hadoop. Hortonworks is making a Falcon Technical Preview available here with the goal of including a fully certified version of Apache Falcon with Hortonworks Data Platform in Q1 of 2014.
- Incubate Apache Falcon
- Dataset replication
- Dataset retention
- Falcon Tech Preview
- Hive/HCatalog integration
- Basic dashboard for entity viewing
- Kerberos security support
- Ambari integration for management
- Advanced dashboard for pipeline building
- Dataset lineage