Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It provides data management services such as retention, replications across clusters, archival etc. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes.
In this tutorial we are going walk the process of mirroring the datasets between Hadoop clusters.
After creating cluster entities, let us go back to Ambari as
admin user. Click on admin menu drop down and then
Click the blue
Users button in the bottom box as given below:
Create Local User button at the top of the page. Enter
ambari-qa as the user name and then set the password for it. Enter it again for confirmation and
Save the user.
You can see the newly added
ambari-qa user. Click on it to assign it a group so that it can access Ambari views.
"views" and select it in
Local Group Membership box and then click on
tick mark to add an
ambari-qa user in the
Now logout of Ambari from the
admin user and login to Ambari as
Select the Files View and you can view the following default folders:
/user/ambari-qa and create a new directory
Click on the row of
falcon directory and then click on
Write permission for both Group and Others and then click
Now create the directories
/user/ambari-qa/falcon as the source and target of the mirroring job we are about to create.
After creating cluster entities, let’s go back to the SSH terminal, switch the user to
root and then to
hadoop fs -mkdir /user/ambari-qa/falcon hadoop fs -mkdir /user/ambari-qa/falcon/mirrorSrc hadoop fs -mkdir /user/ambari-qa/falcon/mirrorTgt
Now we need to set permissions to allow access. You must be logged in as the owner of the directory
hadoop fs -chmod -R 777 /user/ambari-qa/falcon
To create the mirroring job, go back to the Falcon UI on your browser and click on the
Create drop down.
Mirror from the drop down menu, you will see a page like this:
Provide a name of your choice. The name must be unique to the system. We named the Mirror Job
Ensure the File System mirror type is selected, then select the appropriate Source and Target and type in the appropriate paths. In our case the source cluster is
primaryCluster and that HDFS path on the cluster is
The target cluster is
backupCluster and that HDFS path on the cluster is
Also set the validity of the job to your current time, so that when you attempt to run the job in a few minutes, the job is still within the validity period. Keep default values in Advanced Options and then Click
Verify the summary information, then click
Before we can run the job, we need some data to test on HDFS.
<!—Let’s give us permission to upload some data using the HDFS View in Ambari.
su - root su hdfs hadoop fs -chmod -R 775 /user/ambari-qa
Open Ambari from your browser at port 8080.
Then launch the HDFS view from the top right hand corner.
Keep login as ambari-qa and from the view on the Ambari console navigate to the directory
Upload button and upload any file you want to use.
Once uploaded the file should appear in the directory.
Now navigate to the Falcon UI and search for the job we created. The name of the Mirror job we had created was
MirrorTest job by clicking the checkbox and then click on
The state of the job should change from
After a few minutes, use the HDFS View in the Ambari console to check the
/user/ambari-qa/falcon/mirrorTgt directory and you should see that your data is mirrored.
In this tutorial we walked through the process of mirroring the datasets between two cluster entities.