Define and Process Data Pipelines in Hadoop with Apache Falcon

Apache Falcon is a framework for simplifying data governance and pipeline processing


Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.


In this tutorial we will walk through a scenario where email data lands hourly on a cluster. In our example:

  • This cluster is the primary cluster located in the Oregon data center.
  • Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.

The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.

To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.


  • A cluster with Apache Hadoop 2 configured
  • A cluster with Apache Falcon configured

The easiest way to meet the above prerequisites is to download the HDP Sandbox

After downloading the environment, confirm that Apache Falcon is running. Below are the steps to validate that:

  1. if Ambari is not configured on your Sandbox, go and enable Ambari.
    <Display Name>
  2. Once Ambari is enabled, navigate to Ambari at, login with username and password of admin and admin respectively. Then check if Falcon is running.<Display Name>
  3. If Falcon is not running, start Falcon:

<Display Name>

Steps for the Scenario

  1. Create cluster specification XML file
  2. Create feed (aka dataset) specification XML file
    • Reference cluster specification
  3. Create the process specification XML file
    • Reference cluster specification – defines where the process runs
    • Reference feed specification – defines the datasets that the process manipulates

We have already created the necessary xml files. In this step we are going to download the specifications and use them to define the topology and submit the storm job.

Staging the component of the App on HDFS

In this step we will stage the pig script and the necessary folder structure for inbound and outbound feeds on the HDFS:

First download this zip file called to your local host machine.

<Display Name>

Navigate using your browser to the Hue – File Browser interface at to explore the HDFS.

Navigate to /user/ambari-qa folder like below:

Now we will upload the zip file we just downloaded:

<Display Name>

This should also unzip the zip file and create a folder structure with a folder called falcon .

Staging the specifications

SSH in to the VM:

ssh root@ -p 2222;

The password is hadoop

From the SSH session, first we will change our user to ambari-qa. Type:

su ambari-qa

Go to the users home directory:

cd ~

Download the topology, feed and process definitions:


Unzip the file:

unzip ./

Change Directory to the folder created:

cd falconChurnDemo/

Submit the entities to the cluster:

Cluster Specification

Cluster specification is one per cluster.

See below for a sample cluster specification file.

Back to our scenario, lets submit the ‘oregon cluster’ entity to Falcon. This signifies the primary Hadoop cluster located in the Oregon data center.

falcon entity -type cluster -submit -file oregonCluster.xml

Then lets submit the ‘virginia cluster’ entity to Falcon. This signifies the backup Hadoop cluster located in the Virginia data center

falcon entity -type cluster -submit -file virginiaCluster.xml

If you view the XML file you will see how the cluster location and purpose has been captured in the XML file.

Feed Specification

A feed (a.k.a dataset) signifies a location of data and its associated replication policy and late arrival cut-off time.

See below for a sample feed (a.k.a dataset) specification file.

Back to our scenario, let’s submit the source of the raw email feed. This feed signifies the raw emails that are being downloaded into the Hadoop cluster. These emails will be used by the email cleansing process.

falcon entity -type feed -submit -file rawEmailFeed.xml

Now let’s define the feed entity which will handle the end of the pipeline to store the cleansed email. This feed signifies the emails produced by the cleanse email process. It also takes care of replicating the cleansed email dataset to the backup cluster (virginia cluster)

falcon entity -type feed -submit -file cleansedEmailFeed.xml


A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.

Here is an example of what a process specification looks like:

Back to our scenario, let’s submit the ingest and the cleanse process respectively:

The ingest process is responsible for calling the Oozie workflow that downloads the raw emails from the web into the primary Hadoop cluster under the location specified in the rawEmailFeed.xml It also takes care of handling late data arrivals

falcon entity -type process -submit -file emailIngestProcess.xml

The cleanse process is responsible for calling the pig script that cleans the raw emails and produces the clean emails that are then replicated to the backup Hadoop cluster

falcon entity -type process -submit -file cleanseEmailProcess.xml

Schedule the Falcon entities

So, all that is left now is to schedule the feeds and processes to get it going.

Ingest the feed

falcon entity -type feed -schedule -name rawEmailFeed

falcon entity -type process -schedule -name rawEmailIngestProcess

Cleanse the emails

falcon entity -type feed -schedule -name cleansedEmailFeed

falcon entity -type process -schedule -name cleanseEmailProcess


In a few seconds you should notice that that Falcon has started ingesting files from the internet and dumping them to new folders like below on HDFS:

<Display Name>

In a couple of minutes you should notice a new folder called processed under which the files processed through the data pipeline are being emitted:

<Display Name>

We just created an end-to-end data pipeline to process data. The power of the Apache Falcon framework is its flexibility to work with pretty much any open source or proprietary data processing products out there.


September 26, 2014 at 12:43 pm

Very fascinating, quite promising and spectacular!!

Rizwan Mian
December 3, 2014 at 12:36 pm

Works as suggests on the tin.

    July 16, 2015 at 8:58 am

    I need some help in falcon, did setup falcon without errors, but don’t see processing going as suggested, if you don’t mind can you share some details on the same.

    Kiran Jilla

Snehil Suresh Wakchaure
January 17, 2015 at 9:09 pm

Thank you so much! This is a very nice introductory tutorial!

April 16, 2015 at 10:21 pm

The tutorial missing the following steps in creating working/staging HDFS directory. Make sure that you manually create these needed directory. Or, you can create another “prepare-step” shell to create these directories.
hdfs dfs -mkdir -p /apps/falcon/primaryCluster/staging
hdfs dfs -mkdir -p /apps/falcon/primaryCluster/working
hdfs dfs -mkdir -p /apps/falcon/backupCluster/working
hdfs dfs -mkdir -p /apps/falcon/backupCluster/staging

April 19, 2015 at 9:32 pm

After the success of running and I tried to figure out how to kill the workflow jobs since they are going to run periodically as your specifications that you submitted to Oozie by Falcon scheduling. You can’t kill those jobs using Oozie web UI. If you do, you will get something “… HTTP authentication error”. You have to go back to the shell (ambari-qa) commands as below to kill the “scheduled” jobs in Oozie. But, you can leave these entities up in falcon for later reuse or re-run demo.

falcon instance -type process -name rawEmailIngestProcess -kill -start “2014-02-28T00:00Z” -end=”2016-03-31T00:00Z”
falcon instance -type process -name cleanseEmailProcess -kill -start “2014-02-28T00:00Z” -end=”2016-03-31T00:00Z”
falcon instance -type feed -name rawEmailFeed -kill -start “2014-02-28T00:00Z” -end=”2016-03-31T00:00Z”
falcon instance -type feed -name cleansedEmailFeed -kill -start “2014-02-28T00:00Z” -end=”2016-03-31T00:00Z”

Just be careful with the time start and end format since it is very picky about it. Otherwise, it will throw back exception with something “… not a valid UTC string” (in fact, it should say not an ISO8601 format since UTC is not a format).

Also, you can use REST command line too for doing the above. But, I didn’t try it yet.

jaipal reddy
June 17, 2015 at 10:15 pm

How to configure Falcon to send simple email when Workflow starts & ends.

    Andrew Ahn
    July 28, 2015 at 9:22 am

    The default notification is via JMS messages. Email notification is available for mirroring of HDFS and Hive DR entity creation. Generically for feeds and process will be address in the next maintenance release.

July 7, 2015 at 7:03 pm

What’s the user/password of the falcon Web UI?

July 27, 2015 at 3:49 pm

does not work with hdp 2.3 sandbox

    August 7, 2015 at 4:56 am

    I just tested this tutorial in HDP 2.3 sandbox and it works.
    However you have do few additional things in terms of directory creation and its permissions. Before you submit the entities you need to create following directories and set the following permissions.
    Login to sandbox and change to falcon user
    su falcon

    then create following directories
    hdfs dfs -mkdir -p /apps/falcon/primaryCluster/staging
    hdfs dfs -mkdir -p /apps/falcon/primaryCluster/working
    hdfs dfs -mkdir -p /apps/falcon/backupCluster/working
    hdfs dfs -mkdir -p /apps/falcon/backupCluster/staging

    Then change the permission using following command

    hdfs dfs -chmod 777 /apps/falcon/primaryCluster/staging
    hdfs dfs -chmod 755 /apps/falcon/primaryCluster/working
    hdfs dfs -chmod 777 /apps/falcon/backupCluster/staging
    hdfs dfs -chmod 755 /apps/falcon/backupCluster/working

    Rest of the tutorial is perfect.

Andrew Ahn
July 28, 2015 at 9:06 am

Not correct. Please verify the service is running through Ambari. It is off by default in the sandbox to reduce VM load. The setups will updated to us the new UI, but this current tutorial is still valid.

    Andrew Ahn
    July 28, 2015 at 9:59 am

    To access Ambari from the Sandbox first confirm the VM is running. Then browse to:

    1) with user: admin / pass: admin
    2) Select Falcon on left margin and enable the Falcon server and Client service.

August 20, 2015 at 12:09 am

Pretty good data governance tool, but Falcon Web UI does not reveal much about the pipeline path and lineage. Does anyone know about where to look for the graphical tree?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">