Chapter 1. Data Governance with Apache Falcon

Apache Falcon provides a framework for automating data governance by defining data pipelines and providing dynamic changes to that pipeline through the Falcon interface. Falcon eliminates hard coding complex data sets and offers:

  • Data Replication: Falcon can replicate HDFS and Hive datasets, trigger processes for retry, and handle late data arrival logic.

  • Data Lifecycle Management: Falcon schedules eviction based on data retention policies you set.

  • Dataset Traceability: Falcon exposes coarse-grained dependencies between clusters, datasets, and processes.

Falcon can be installed and managed by Apache Ambari, and jobs can be traced through the native Falcon UI. Falcon can process data from:

  • Oozie jobs

  • Pig scripts

  • Hive scripts

These jobs can then trigger alerts back to Falcon to give you the latest status on your data pipeline activities.

To learn more about Falcon, choose any of the following topics: