Two weeks ago Hortonworks presented the third in series of 8 Discover HDP 2.2 webinars: Discover HDP 2.2: Discover HDP 2.2: Apache Falcon for Hadoop Data Governance. Andrew Ahn, Venkatesh Seetharam, and Justin Sears hosted this 3rd webinar in the series.
After Justin Sears set the stage for the webinar by explaining the drivers behind Modern Data Architecture (MDA), Andrew Ahn and Venkatesh Seetharam introduced and discussed how to use Apache Falcon for central management of data lifecycle, business continuity and disaster recovery, and audit and compliance requirement. They also covered Apache Falcon innovations now included in HDP 2.2:
Here is the complete recording of the Webinar
Here are the presentation slides on Slideshare.
We’re grateful to the many participants who joined the HDP 2.2 webinar and asked excellent questions. This is the complete list of questions with their corresponding answers:
|How does a data pipeline relate to a DAG?||You may express a DAG as a pipeline or a pipeline as a DAG. It depends on the use case. Because Falcon supports multiple engine pipelines, you can execute your DAG using Oozie.|
|How does Falcon handle late data management? Does it let the process wait for late data? How does it detect there’s more data coming?||
The arrival of data check is by a configurable polling (linear or exponential). The cutoff is time based is also configurable.
The size of a data set is recorded when a process is launched and then monitored for changes.
|What does a MR developer have to do in order to make use of Falcon features such as lineage? Must the developer write special instrumentation code and/or calls to Falcon API?||
Falcon cannot provide lineage for MR that does not make use of the feed entity.
However if the input and output feeds (datasets) are defined properly — meaning that all the datasets in the MR (process entity) is accounted for as feed entities—then lineage will work.
|Does the UI display existing Falcon jobs?||
There is a simple read-only dashboard that displays jobs.
Currently, you can view jobs from Oozie and Ambari. However, an improved Falcon UI will have discrete screen for this (Preview in Jan 2015)
|Can Hive/PIG be integrated with Falcon? If so how? And for what type of use cases?||
Falcon supports Apache Hive, Pig and Oozie engines. Please refer to Apache Falcon docs.
Falcon can call any 3rd party script: it is defined in the process entity in the XML. Please reference our tutorial for an example.
|What replication techniques does Falcon support? For example, synchronous, asynchronous, delta changes, snapshots etc? How do you manage the recovery?||Falcon replication is asynchronous with delta changes. Recovery is done by running a process and swapping the source and target.|
|Can Falcon jobs be submitted through Apache Knox or only by direct access to Hadoop?||Direct access is currently supported. Access via Knox coming in future release.|
|Can I use Falcon for disaster recovery solution for other databases?||At this time Falcon support is limited to HCat replication in 2.2 in non-secure mode. Hive (and Metadata) secure mode will be support early 2015.|
|Does mirroring work at across data centers?||Yes|
|What type of Data Quality functionalities Falcon supports?||The data copied by Falcon is validated for quality via checksums. The rest is outside product scope.|
|What’s the difference between HDFS mirroring and replication factor? If my data is already replicated 3X (or however much I determine) what value does mirroring provide?||HDFS mirroring is across clusters, networks, and data centers, while replication factor is within a single cluster. For example, to ship some data from regional data center to a central data center for further examination or transformation, you can use Falcon HDFS Mirroring between the two data centers.|
|How does Falcon integrate with Flume (interceptors)?||Falcon doesn’t directly integrate at this time. A work around would be to land the data in HDFS and then use Falcon to tag and create an audit event (no io for this operation).|
|Does HDFS Mirroring handle incremental backups? Does it use the snapshot feature in HDFS to do it?||Falcon does not use snapshots as in HDFS, though it uses incremental backups. Snapshot is used more for a backup and recovery for a given cluster than mirroring. Presently, because of the scale of data to ship snapshot across clusters, it is not supported in Falcon.|
|Are there plans to introduce data retention support for HBase?||We have complete Hive integration support, but not for HBase. However, we plan to support it in the future.|
|Does Falcon support or interact with HCatalog?||We have native integration with HCatalog and Hive, so you could model your feeds and datasets into Falcon as Hive tables. Falcon employs HCatalog API to access Hive metastore.|
|Is there any distance constraint to HDFS mirroring?||There may be some latency, no different from sending a packet across a distant network.|
|Is there an API for extracting metadata?||Falcon has an API to inquire metadata that allows you to export an entire lineage of data in various formats, including JSON.|
|Can you specify a complex rule to retain data for X number of days but keep N Hive partitions?||We have not seen a requirement where you need both, but today you can specify a simple rule to retain a type of data for min, hours, days, months etc.|
|Does Falcon impose any metadata limits to tagging during data ingestion?||No, there’s no limit. You can add as many tags as you wish.|