Data Governance & Integration

Simplified Data Processing for Enterprise Hadoop

Hand-coding data processing pipelines for Hadoop can be tedious and time consuming. A processing application needs to handle the data transformation logic, the replication logic, and the retention logic, not to mention the orchestration, scheduling and retry logic across workflows. Often, pipeline processing involves datasets that span clusters and sometimes even data centers. This adds to the complexity.

The solution to this problem goes beyond providing a simple SDK or a new Java library. Certainly, those items can help to improve developer efficiency when writing MapReduce code. But we believe in tackling the Hadoop pipeline challenge in a way that promotes reuse and consistency. This requires a more declarative approach. And any solution must work with components of Hadoop that are already known and trusted.

Our objective is to provide a data governance solution centered around Apache Falcon that makes it easier to build and automate the execution of complex pipelines. Falcon enforces reuse and consistency at its core to enable tracing and data provenance. And while Falcon leverages the existing components of Hadoop (such as Apache Sqoop and Apache Flume for data integration), it is also flexible enough to support new ecosystem projects in the future.

Initiative Goals

Provide data processing “building blocks” for describing data pipelines. Define a dataset and process once, use it again and again.
Automate processing across datasets, clusters and data centers and handle process orchestration and scheduling in a consistent way.
Follow a dataset path through processing pipelines and across clusters. Consistency and reuse promote traceability.

Already Delivered

The team at InMobi and engineers from Hortonworks initiated the Apache Falcon incubation project in April 2013. Since then, Hortonworks has worked with InMobi and the community to make Falcon a deeply integrated component of Hadoop.

Apache Falcon is a fully certified component of HDP, for centralized monitoring of data pipelines.

The Apache Falcon community has already delivered these features:

  • Support for Kerberos clusters
  • Support for both Ubuntu and Windows platforms
  • Integration with Apache Ambari for installation, management and monitoring

    Coming Next

    Apache Falcon version 0.5 will capture data pipeline lineage information and provide access to it through the user interface and API. It will also allow users to throttle bandwidth used by data replication jobs and more closely monitor data pipelines.

    Future releases will include data pipeline audits, providing cluster administrators information about who modified a dataset and when. The community will also add data pipeline lineage, which will help to analyze how a dataset reached a particular state.

  • Essential Timeline

    Phase 1
    • Incubate Apache Falcon
    • Dataset Replication
    • Dataset Retention
    • Falcon Tech Preview
    Q4 2013
    Phase 2
    • Basic Pipeline Dashboard
    • Kerberos Security Support
    • Support for Windows Platform
    • Ambari Integration for Management
    Delivered(HDP 2.1)
    Phase 3
    • Advanced Pipeline Management Dashboard
    • Centralized Audit & Lineage
    • Dataset Lineage
    • Improved User Interface
    • Replicate to Cloud: Azure & S3
    • Hive/HCat Metastore Replication
    • HDFS Snapshots & Hive ACID Support
    Phase 4
    • Visual Pipeline Design
    • File Import: SSH & SCP

    Join the Webinar!

    Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
    Wednesday, January 28, 2015
    1:30 PM ET / 10:30 AM PT

    More Webinars »

    HDP Search Workshop
    Thursday, January 29, 2015
    1:00 PM Eastern / 10:00 AM Pacific

    More Webinars »

    Try these Tutorials

    Hortonworks Data Platform
    The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
    Get started with Sandbox
    Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
    Modern Data Architecture
    Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.