cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Introduction

Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.

It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes.In this tutorial we are going walk the process of:

  • Defining the feeds and processes
  • Defining and executing a job to mirror data between two clusters
  • Defining and executing a data pipeline to ingest, process and persist data continuously

Prerequisite

Once you have download the Hortonworks sandbox and run the VM, navigate to the Ambari interface on the port 8080 of the IP address of your Sandbox VM. Login with the username of admin and the password as what you set it to when you changed it. You should have a similar image as below:

Outline

Scenario

In this tutorial, we will walk through a scenario where email data lands hourly on a cluster. In our example:

  • This cluster is the primary cluster located in the Oregon data center.
  • Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.

The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.

To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.

Start Falcon

By default, Falcon is not started on the sandbox. You can click on the Falcon icon on the left hand bar:

Then click on the Service Actions button on the top right:

Then click on Start:

Once, Falcon starts, Ambari should clearly indicate as below that the service has started:

Download and stage the dataset

Now let’s stage the dataset using the commandline. Although we perform many of these file operations below using the command line, you can also do the same with the HDFS Files View in Ambari.

First, enter the shell with your preffered shell client. For this tutorial, we will SSH into Hortonworks Sandbox with the command:

ssh root@127.0.0.1 -p 2222;

The default password is hadoop

Then login as user hdfs

su - hdfs

Then download the file falcon.zip with the following command”

wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falcon.zip

and then unzip with the command

unzip falcon.zip

Now let’s give ourselves permission to upload files

hadoop fs -chmod -R 777 /user/ambari-qa

then let’s create a folder falcon under ambari-qa with the command

hadoop fs -mkdir /user/ambari-qa/falcon

Now let’s upload the decompressed folder with the command

hadoop fs -copyFromLocal demo /user/ambari-qa/falcon/

Create the cluster entities

Before creating the cluster entities, we need to create the directories on HDFS representing the two clusters that we are going to define, namely primaryCluster and backupCluster.

Use hadoop fs -mkdir commands to create the directories /apps/falcon/primaryCluster and /apps/falcon/backupCluster directories on HDFS.

hadoop fs -mkdir /apps/falcon/primaryCluster
hadoop fs -mkdir /apps/falcon/backupCluster

Further create directories called staging inside each of the directories we created above:

hadoop fs -mkdir /apps/falcon/primaryCluster/staging
hadoop fs -mkdir /apps/falcon/backupCluster/staging

Next we will need to create the working directories for primaryCluster and backupCluster

hadoop fs -mkdir /apps/falcon/primaryCluster/working
hadoop fs -mkdir /apps/falcon/backupCluster/working

Finally you need to set the proper permissions on the staging/working directories:

hadoop fs -chmod 777 /apps/falcon/primaryCluster/staging
hadoop fs -chmod 755 /apps/falcon/primaryCluster/working
hadoop fs -chmod 777 /apps/falcon/backupCluster/staging
hadoop fs -chmod 755 /apps/falcon/backupCluster/working

Let’s open the Falcon Web UI. You can easily launch the Falcon Web UI from Ambari:

You can also navigate to the Falcon Web UI directly on our browser. The Falcon UI is by default at port 15000. The default username is ambari-qa.

This UI allows us to create and manage the various entities like Cluster, Feed, Process and Mirror. Each of these entities are represented by a XML file which you either directly upload or generate by filling up the various fields.

You can also search for existing entities and then edit, change state, etc.

Let’s first create a couple of cluster entities. To create a cluster entity click on the Cluster button on the top.

A cluster entity defines the default access points for various resources on the cluster as well as default working directories to be used by Falcon jobs.

To define a cluster entity, we must specify a unique name by which we can identify the cluster. In this tutorial, we use:

primaryCluster

Next enter a data center name or location of the cluster and a description for the cluster. The data center name can be used by Falcon to improve performance of jobs that run locally or across data centers.

All entities defined in Falcon can be grouped and located using tags. To clearly identify and locate entities, we assign the tag:

EntityType

We then need to specify the owner and permissions for the cluster.

So we enter:

Owner:  ambari-qa
Group: users
Permissions: 755

With the value

Cluster

Next, we enter the URI for the various resources Falcon requires to manage data on the clusters. These include the NameNode dfs.http.address, the NameNode IPC address used for Filesystem metadata operations, the Yarn client IPC address used for executing jobs on Yarn, and the Oozie address used for running Falcon Feeds and Processes, and the Falcon messaging address. The values we will use are the defaults for the Hortonworks Sandbox, if you run this tutorial on your own test cluster, modify the addresses to match those defined in Ambari:

Readonly hftp://sandbox.hortonworks.com:50070
Write hdfs://sandbox.hortonworks.com:8020"
Execute sandbox.hortonworks.com:8050
Workflow http://sandbox.hortonworks.com:11000/oozie/
Messaging tcp://sandbox.hortonworks.com:61616?daemon=true

The versions are not used and will be removed in the next version of the Falcon UI.

You can also override cluster properties for a specific cluster. This can be useful for test or backup clusters which may have different physical configurations. In this tutorial, we’ll just use the properties defined in Ambari.

After the resources are defined, you must define default staging, temporary and working directories for use by Falcon jobs based on the HDFS directories created earlier in the tutorial. These can be overridden by specific jobs, but will be used in the event no directories are defined at the job level. In the current version of the UI, these directories must exist, be owned by falcon, and have the proper permissions.

Staging  /apps/falcon/primaryCluster/staging
Temp /tmp
Working /apps/falcon/primaryCluster/working

Once you have verified that the entities are the correct values, press Next.

Click Save to persist the entity.

You should receive a notification that the operation was successful.

Falcon jobs require a source and target cluster. For some jobs, this may be the same cluster, for others, such as Mirroring and Disaster Recovery, the source and target clusters will be different. Let’s go ahead and create a second cluster by creating a cluster with the name:

backupCluster

Reenter the same information you used above except for the directory information. For the directories, use the backupCluster directories created earlier in the tutorial.

Staging  /apps/falcon/backupCluster/staging
Temp /tmp
Working /apps/falcon/backupCluster/working

Click Save to persist the backupCluster entity.

Define the rawEmailFeed entity

To create a feed entity click on the Feed button on the top of the main page on the Falcon Web UI.

Then enter the definition for the feed by giving the feed a unique name and a description. For this tutorial we will use

rawEmailFeed

and

Raw customer email feed.

Let’s also enter a tag, so we can easily locate this Feed later:

externalSystem=USWestEmailServers

Feeds can be further categorised by identifying them with one or more groups. In this demo, we will group all the Feeds together by defining the group:

churnAnalysisDataPipeline

We then set the ownership information for the Feed:

Owner:  ambari-qa
Group:  users
Permissions: 755

Next we specify how often the job should run.

Let’s specify to run the job hourly by specifying the frequency as 1 hour.
Click Next to enter the path of our data set:

/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}

We will set the stats and meta path to / for now.

Once you have verified that these are the correct values press Next.

On the Clusters page enter today’s date and the current time for the validity start time and enter an hour or two later for the end time. The validity time specifies the period during which the feed will run. For many feeds, validity time will be set to the time the feed is scheduled to go into production and the end time will be set into the future. Because we are running this tutorial on the Sandbox, we want to limit the time the process will run to conserve resources.

Click Next

Save the feed

Define the rawEmailIngestProcess entity

Now lets define the rawEmailIngestProcess.

To create a process entity click on the Process button on the top of the main page on the Falcon Web UI.

Use the information below to create the process:

process name rawEmailIngestProcess
Tags email
With the value: testemail

This job will run on the primaryCluster.
Again, set the validity to start now and end in an hour or two.

For the properties, set the number of parallel processes to 1, this prevents a new instance from starting prior to the previous one completing.

Specify the order as first-in, First-out (FIFO) and the Frequency to 1 hour.

For inputs and output, enter the rawEmailFeed we created in the previous step and specify now(0,0) for the instance. And assign the workflow the name:

emailIngestWorkflow

Select Oozie as the execution engine and provide the following path:

/user/ambari-qa/falcon/demo/apps/ingest/fs

Accept the default values and click next

On the Clusters page ensure you modify the validity by setting the end time to the next day as in the picture below and then click next

Accept the default values and click Next

Let’s Save the process.

Define the cleansedEmailFeed

Again, to create a feed entity click on the Feed button on the top of the main page on the Falcon Web UI.

Use the following information to create the feed:

name cleansedEmailFeed"
description Cleansed customer emails"
tag cleanse with value cleaned
Group churnAnalysisDataPipeline
Frequency 1 hour

We then set the ownership information for the Feed:

Owner:  ambari-qa
Group:  users
Permissions: 755

Set the default storage location to

/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"

Select the primary cluster for the source and again set the validity start for the current time and end time to an hour or two from now.

Specify the path for the data as:

/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}

And enter / for the stats and meta data locations

Set the target cluster as backupCluster and again set the validity start for the current time and end time to an hour or two from now

And specify the data path for the target to

/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}

Set the statistics and meta data locations to /

Accept the default values and click Next

Accept the default values and click Next

On the Clusters page ensure you modify the validity to a time slice which is in the very near future and then click Next

Accept the default values and click Save

Define the cleanseEmailProcess

Now lets define the cleanseEmailProcess.
Again, to create a process entity click on the Process button on the top of the main page on the Falcon Web UI.

Create this process with the following information

process name cleanseEmailProcess

Tag cleanse with the value yes

We then set the ownership information:

Owner:  ambari-qa
Group:  users
Permissions: 755

This job will run on the primaryCluster.

Again, set the validity to start now and end in an hour or two.

For the properties, set the number of parallel processes to 1, this prevents a new instance from starting prior to the previous one completing.

Specify the order as first-in, First-out (FIFO)

And the Frequency to 1 hour.

For inputs and output, enter the rawEmailFeed we created in the previous step and specify it as input and now(0,0) for the instance.

Add an output using cleansedEmailFeed and specify now(0,0) for the instance.

Then assign the workflow the name:

emailCleanseWorkflow

Select Pig as the execution engine and provide the following path:

/user/ambari-qa/falcon/demo/apps/pig/id.pig

Accept the default values and click Next

On the Clusters page ensure you modify the validity to a time slice which is in the very near future and then click Next

Select the Input and Output Feeds as shown below and Save

Run the feeds

From the Falcon Web UI home page search for the Feeds we created

Select the rawEmailFeed by clicking on the checkbox

Then click on the Schedule button on the top of the search results

Next run the cleansedEmailFeed in the same way

Run the processes

From the Falcon Web UI home page search for the Process we created

Select the cleanseEmailProcess by clicking on the checkbox

Then click on the Schedule button on the top of the search results

Next run the rawEmailIngestProcess in the same way

If you visit the Oozie process page, you can seen the processes running

Input and Output of the pipeline

Now that the feeds and processes are running, we can check the dataset being ingressed and the dataset egressed on HDFS.

Here is the data being ingressed

and here is the data being egressed from the pipeline

Summary

In this tutorial we walked through a scenario to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis by defining a data pipeline with Apache Falcon.


Tutorial Q&A and Reporting Issues

If you need help or have questions with this tutorial, please first check
HCC for existing Answers to questions on this tutorial using the Find Answers
button. If you don’t find your answer you can post a new HCC question for
this tutorial using the Ask Questions button.
Find Answers Ask Questions
Tutorial Name: Processing Data Pipeline with Apache Falcon
HCC Tutorial Tag: tutorial-580 and HDP-2.4
If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab.
All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be
contributed via the Hortonworks Tutorial
Contribution Guide
. If you are certain there is an
issue or bug with the tutorial, please
create an issue
on the repository and we will do our best to resolve it!