Define and Process Data Pipelines in Hadoop with Apache Falcon
Apache Falcon is a framework for simplifying data governance and pipeline processing
IntroductionApache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
ScenarioIn this tutorial we will walk through a scenario where email data lands hourly on a cluster. In our example:
- This cluster is the primary cluster located in the Oregon data center.
- Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.
- A cluster with Apache Hadoop 2 configured
- A cluster with Apache Falcon configured
- if Ambari is not configured on your Sandbox, go
http://127.0.0.1:8000/about/and enable Ambari.
- Once Ambari is enabled, navigate to Ambari at
http://127.0.0.1:8080, login with username and password of
adminrespectively. Then check if Falcon is running.
- If Falcon is not running, start Falcon:
Steps for the Scenario
- Create cluster specification XML file
- Create feed (aka dataset) specification XML file
- Reference cluster specification
- Create the process specification XML file
- Reference cluster specification – defines where the process runs
- Reference feed specification – defines the datasets that the process manipulates
Staging the component of the App on HDFSIn this step we will stage the pig script and the necessary folder structure for inbound and outbound feeds on the HDFS: First download this zip file called
falcon.zipto your local host machine. Navigate using your browser to the Hue - File Browser interface at http://127.0.0.1:8000/filebrowser/ to explore the HDFS. Navigate to
/user/ambari-qafolder like below: Now we will upload the zip file we just downloaded: This should also unzip the zip file and create a folder structure with a folder called
Staging the specificationsSSH in to the VM:
ssh firstname.lastname@example.org -p 2222;The password is
hadoopFrom the SSH session, first we will change our user to
su ambari-qaGo to the users home directory:
cd ~Download the topology, feed and process definitions:
wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falconDemo.zipUnzip the file:
unzip ./falconDemo.zipChange Directory to the folder created:
Submit the entities to the cluster:
Cluster SpecificationCluster specification is one per cluster. See below for a sample cluster specification file. Back to our scenario, lets submit the 'oregon cluster' entity to Falcon. This signifies the primary Hadoop cluster located in the Oregon data center.
falcon entity -type cluster -submit -file oregonCluster.xmlThen lets submit the 'virginia cluster' entity to Falcon. This signifies the backup Hadoop cluster located in the Virginia data center
falcon entity -type cluster -submit -file virginiaCluster.xmlIf you view the XML file you will see how the cluster location and purpose has been captured in the XML file.
Feed SpecificationA feed (a.k.a dataset) signifies a location of data and its associated replication policy and late arrival cut-off time. See below for a sample feed (a.k.a dataset) specification file. Back to our scenario, let's submit the source of the raw email feed. This feed signifies the raw emails that are being downloaded into the Hadoop cluster. These emails will be used by the email cleansing process.
falcon entity -type feed -submit -file rawEmailFeed.xmlNow let's define the feed entity which will handle the end of the pipeline to store the cleansed email. This feed signifies the emails produced by the cleanse email process. It also takes care of replicating the cleansed email dataset to the backup cluster (virginia cluster)
falcon entity -type feed -submit -file cleansedEmailFeed.xml
ProcessA process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on. Here is an example of what a process specification looks like: Back to our scenario, let's submit the ingest and the cleanse process respectively: The ingest process is responsible for calling the Oozie workflow that downloads the raw emails from the web into the primary Hadoop cluster under the location specified in the rawEmailFeed.xml It also takes care of handling late data arrivals
falcon entity -type process -submit -file emailIngestProcess.xmlThe cleanse process is responsible for calling the pig script that cleans the raw emails and produces the clean emails that are then replicated to the backup Hadoop cluster
falcon entity -type process -submit -file cleanseEmailProcess.xml
Schedule the Falcon entitiesSo, all that is left now is to schedule the feeds and processes to get it going.
Ingest the feed
falcon entity -type feed -schedule -name rawEmailFeed
falcon entity -type process -schedule -name rawEmailIngestProcess
Cleanse the emails
falcon entity -type feed -schedule -name cleansedEmailFeed
falcon entity -type process -schedule -name cleanseEmailProcess