Chapter 7. Creating Falcon Entity Definitions

Before you can use Falcon to process, manage, or replicate data or configure for disaster recovery, you must create directories in HDFS for Falcon to use, then create the necessary Falcon entity definitions. You can create the entity definitions from the web UI or from the CLI.

There are three types of entities.

Cluster

Defines the cluster, including its interfaces, where data and processes are stored.

Cluster entities are required for all jobs.

Feed

Defines the datasets to be cleaned and processed.

Feed entities are required for replication and retention.

Process

Defines how the process (such as a Pig or Hive job) works with the dataset on top of a cluster. A process consumes feeds, invokes processing logic (including late data handling), and produces further feeds. It also defines the configuration of the Oozie workflow and defines when and how often the workflow should run.

Process entities are required for all jobs. However, you can run a process without a feed entity if data replication or retention are not required.

Replication Between HDP Versions

Before you define the resources to be used in replication jobs, ensure that the versions of HDP you are running are compatible for replication.

Table 7.1. Supported HDP Versions for Replication

HDP Release	Can Replicate To	Cannot Replicate To
2.0	2.0, 2.1	2.2 or later releases
2.1	2.1	2.2 or later releases
2.2 or later	2.2 and any subsequent release	2.1 or earlier releases

Running Falcon in a Secure Environment

Before accessing Falcon in a secure environment, your cluster must be properly configured following the instructions in the HDP Security guide.

After completing an Ambari secure installation with Falcon, Kerberos is enabled. You can access the Falcon web UI from the Ambari Views menu without any additional configuration.

When creating Falcon entities, most of the required Kerberos information is automatically populated in the web UI.

If secure HDP is installed from the CLI, you need to ensure that Falcon was also configured for Kerberos.

If you need to access the standalone Falcon web UI, you might need to add the VM name and domain to your browser, following the browser instructions for Kerberos single sign-on (SSO) configuration.

Creating HDFS Directories for Falcon

You must create the following directories in HDFS prior to submitting a cluster entity to Falcon:

/staging: Falcon stores artifacts of processes and feeds, such as the feed and process definitions and job logs, in /staging. When you schedule an entity, the workflow and library you specify in the entity are copied to the staging path.
/working: Falcon copies the .jar files needed to run processes and feeds to the working directory.
/temp: Falcon uses the Temp directory to do intermediate processing of entities in HDFS.

Steps

In the Ambari UI, ensure that the Falcon and Oozie services are running.
Confirm that the Falcon and Oozie services have green check marks adjacent to them on the Ambari dashboard:

Figure 7.1. Ambari Dashboard Falcon and Oozie Service Indicators
Log in to the HDFS server as the falcon user.
```
sudo su falcon
```
Create the directory structure on HDFS for the staging, temporary, and working folders in which the cluster entity stores the dataset.
These folders must be owned by the falcon user.
```
hadoop fs -mkdir -p /apps/falcon/primary_Cluster
hadoop fs -mkdir -p /apps/falcon/primary_Cluster/staging
hadoop fs -mkdir -p /apps/falcon/primary_Cluster/working
hadoop fs -mkdir -p /apps/falcon/tmp
```
Tip
You can also create the directories from the Ambari UI File View. You must be logged in as user falcon.

Set permissions on the cluster staging directory to 777 (read/write/execute for owner/group/others) to allow multiple user access.

hadoop fs -chmod -R 777 <your_staging_directory_path>

Only Oozie job definitions are written to the staging directory, so setting permissions to 777 does not create any vulnerability. Keep permissions on the working directory set to 755 (rwx/r-w/r-w).

	Tip
	You can also change permissions on directories from the Ambari UI Administration menu. You must be logged in as user falcon with administrator access.

Launch the Falcon web UI from Ambari:

On the Services tab, select Falcon in the services list.

At the top of the Falcon service page, click Quick Links, and then click Falcon Web UI.

	Important
	To access the Falcon UI on a secured cluster, use `kinit` before accessing the Kerberos-protected web server and configure your browsers for SPNEGO access. For more information, see Configure Browser for SPNEGO.

Defining Entities Using the Falcon Web UI

The Apache Falcon web UI simplifies the process of defining and deploying entities in a data pipeline. Using the web UI ensures that the XML definition file that you use to deploy the data pipeline to the Falcon server is well-formed. You can access the Falcon web UI from Apache Ambari.

See the Ambari Automated Install Guide or Non-Ambari Cluster Installation Guide for installation instructions for the Falcon component.

See the Apache Ambari Views guide for installation instructions for the Falcon View in Ambari.

Prerequisites

Ensure that the following components have been installed on your cluster:
- Hortonworks Data Platform (HDP)
- Apache Falcon
- Apache Oozie
Ensure that the required HDFS directories have been created with the appropriate permissions. See the section called “Creating HDFS Directories for Falcon”.

Steps

To create a data pipeline, you must define the entities you are using, by performing the following actions:

Creating a Cluster Entity Definition Using the Web UI

You must specify a cluster entity before defining other elements in your data pipeline. The cluster entity defines where the data and the processes for your data pipeline are stored. For more information, see the Apache cluster entity XSD.

Steps

At the top of the Falcon web UI page, click Create > Cluster.

On the New Cluster page, specify the values for the following properties:

Table 7.2. Cluster Entity General Properties

Property	Description
Cluster Name	Name of the cluster resource entity. Not necessarily the actual cluster name. Cannot be changed after the entity is created. The naming criteria are as follows: Must be unique to the data center Must start with a letter Is case sensitive Can contain a maximum of 40 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces Must match the name of the directory containing the staging, tmp, and working subdirectories.
Data Center or Colo Name and Description	The data center to which the cluster belongs. Used to identify the correct cluster for the entity.
Tags	Metadata tagging to classify the cluster.

Table 7.3. Cluster Entity Interface Properties

Property	Description
File System Read Endpoint Address	A read-only interface that specifies the endpoint and protocol to connect to a data source. Required for DistCp (distributed copy). This would be used for replication, such as importing from a data source into HDFS. Located in Ambari at `HDFS>Configs>Advanced>NameNode>NameNode host`. The URI might be hftp://<hostname>:50070 or hdfs://<hostname>:8020.
File System Default Address	A write interface that specifies the endpoint and protocol to write to the datasource. Falcon uses this interface to export data from HDFS to the data source. Located in Ambari at `HDFS>Config>Advanced>Advanced core-site> fs.defaultFS`.
YARN Resource Manager Address	Required to write jobs to MapReduce. Located in Ambari at `Yarn>Config>Advanced>Resource Manager>ResourceManager host`.
Workflow Address	Required to submit workflow jobs. Located in Ambari at `Oozie>Config>Advanced>Oozie Server>Oozie Server host`.
Message Broker Address	Required to send alerts. Located in Ambari at `Falcon>Config>Advanced>broker.url`.
Metadata Catalog Registry	Optional. Use to register or deregister partitions in the Hive Metastore and to fetch events about partition availability. Located in Ambari at `Hive>Config>Advanced>General>hive.metastore.uris`. In a secure environment, you must also add a property name and value for `hive.metastore.kerberos.principal` found under `Advanced hive-site` in Ambari. Example: key=hive.metastore.kerberos.principal, value=hive/_HOST@EXAMPLE.COM
Spark	Optional. Select if you want to run a Spark job. You can deploy Spark on a YARN cluster or client, locally, or on a custom site. You only need to add an endpoint for the Custom option. Important: The `yarn-client` execution mode for the Oozie Spark action is no longer supported. Oozie and Falcon continue to support `yarn-cluster` mode.

Table 7.4. Cluster Entity Properties & Location Properties

Property	Description
Properties	Specify a name and value for each property. In a secure environment, the Name and Value fields are automatically populated with content from the `Custom falcon-startup.properties` section in Ambari. Example: key name=`dfs.namenode.kerberos.principal`, value=`nn/_HOST@EXAMPLE.COM`
Location	Specify the HDFS locations for the staging, temp, and working directories. The directory that contains the staging and working subdirectories must have the same name as the entity's Cluster Name. For more information, see Creating HDFS Directories for Falcon.

Table 7.5. Cluster Entity Advanced Properties

Property	Description
Access Control List	Specify the owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x).

Click Next to view a summary of your cluster entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the cluster entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the cluster entity by clicking the Clusters icon at the top right of the window.
If the entity name appears in the cluster list, the entity was created.
Schedule the entity.
1. In the cluster list, click the checkbox next to an entity name with status of Submitted.
2. Click Schedule.
  After a few seconds a success message displays.

	Tip
	You can modify an existing cluster entity from the clusters list by clicking Edit. However, the Falcon service must be in safemode before you can edit the cluster entity.

Creating a Feed Entity Definition Using the Web UI

The feed entity defines the datasets that are cleaned and processed in your data pipeline. For more information, see the Apache feed entity XSD.

The Feed entities have policies attached to them that need to be explicitly scheduled by Falcon. Falcon takes the retention, replication, feed frequency, and delays and creates Oozie Coordinator jobs to automate all of these actions for you. To process data, you must define two feed entities: One for data input and one for data output.

Steps

At the top of the Falcon web UI page, click Create > Feed.

On the New Feed page, specify the following values:

Table 7.6. General Feed Properties

Property	Description
Feed Name and Description	The dataset feed name must be unique on the cluster. This name is referenced by processes as input or output feed. Cannot be changed after the entity is created.
Tags	Metadata tagging to classify the dataset. For example, you can set the key to "externalTarget" and the corresponding value to "Marketing", tagging this feed for marketing.
Feed Groups	A group is a logical association of feeds. Feeds can belong to multiple groups. A group is said to be available if all the feeds belonging to a group are available. The frequency value must be the same for all feeds in a group.
Type	Options are Hive, HDFS, RDBMS Import, and RDBMS Export. You can also choose whether to enable replication for the feed type.

Table 7.7. Hive Source and Target Feed Properties

Property	Description
Cluster	The target cluster entity.
Table URI	Metadata catalog specification for the Hive table, in the metadata format `catalog:$database_name:$table#(partition-key=partition-value);+`. Example: catalog:logs-db:clicks#ds=${YEAR}-${MONTH}-${DAY}
Start and End Times	Date is entered in mm/dd/yyyy format. Time is set to a 12-hour scale. After the feed job starts, it continues at the given Frequency setting until reaching the end time.
Retention	Period to retain instances and number of instances to retain.
Frequency	How often the feed is generated. Valid frequency types are minutes, hours, days, and months.

[D]

Table 7.8. HDFS Source and Target Feed Properties

Property	Description
Cluster	The target cluster entity.
Data Path	Path of the HDFS data to be exported. Example: `/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}`
Statistics Path	(Optional) Path to directory in which to store statistics.
Start and End Times	Date is entered in mm/dd/yyyy format. Time is set to a 12-hour scale. After the feed job starts, it continues at the given Frequency setting until reaching the end time.
Retention	Period to retain instances and number of instances to retain.
Frequency	How often the feed is generated. Valid frequency types are minutes, hours, days, and months.

[D]

Table 7.9. RDBMS Import Source and Target Feed Properties

Property	Description
Datasource	Connection information for a remote data source, such as an RDBMS.
Table	Name of the RDBMS table to be imported or exported.
Extract Type	Options are Full or Incremental
Merge Type	Options are Snapshot or Append
Columns	Select to have all columns replicated, or enter the specific columns to include or exclude.
Location	Options are HDFS or Hive
Cluster	The target cluster entity.
Data Path	(HDFS only) Path of the HDFS data to be exported. Example: `/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}`
Statistics Path	(HDFS only--optional) Path to directory in which to store statistics.
Table URI	(Hive only) Metadata catalog specification for the Hive table, in the metadata format `catalog:$database_name:$table#(partition-key=partition-value);+`. Example: catalog:logs-db:clicks#ds=${YEAR}-${MONTH}-${DAY}
Start and End Times	Date is entered in mm/dd/yyyy format. Time is set to a 12-hour scale. After the feed job starts, it continues at the given Frequency setting until reaching the end time.
Retention	Period to retain instances and number of instances to retain.
Frequency	How often the feed is generated. Valid frequency types are minutes, hours, days, and months.

Table 7.10. RDBMS Export Source and Target Feed Properties

Property	Description
Location	Options are HDFS or Hive
Cluster	The target cluster entity.
Data Path	(HDFS only) Path of the HDFS data to be exported. Example: `/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}`
Statistics Path	(HDFS only--optional) Path to directory in which to store statistics.
Table URI	(Hive only) Metadata catalog specification for the Hive table, in the metadata format `catalog:$database_name:$table#(partition-key=partition-value);+`. Example: catalog:logs-db:clicks#ds=${YEAR}-${MONTH}-${DAY}
Start and End Times	Date is entered in mm/dd/yyyy format. Time is set to a 12-hour scale. After the feed job starts, it continues at the given Frequency setting until reaching the end time.
Retention	Period to retain instances and number of instances to retain.
Datasource	Connection information for a remote data source, such as an RDBMS.
Table	Name of the RDBMS table to be imported or exported.
Load Method	Options are Update Only or Allow Insert. Indicates whether rows can be inserted into the RDBMS or only updates to existing rows are allowed.
Columns	Select to have all columns replicated, or enter the specific columns to include or exclude.
Frequency	How often the feed is generated. Valid frequency types are minutes, hours, days, and months.

[D]

Table 7.11. Advanced Feed Properties

Property	Description
Queue Name	Hadoop job queue.
Job Priority	Hadoop job priority.
Late Arrival Cutoff	The timeframe within which a dataset feed can arrive late and still be processed. The feed instance is rerun within the time specified to capture any late arriving data.
Availability Flag	Specifies the name of a file that when present in a feed's data directory, determines that the feed is available.
Max Map Jobs	The maximum number of maps used during replication. This setting impacts performance and throttling.
Max Bandwidth	The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling.
Parallel	The concurrent replication instances that can run at any given time. This setting impacts performance and throttling.
Access Control List	Specify the HDFS access permissions. The owner must be the owner of the Falcon `staging` and `working` directories.
Feed Schema	Specify the schema location and provider. This is required for HDFS. If none, enter `/none`.

Click Next to view a summary of your feed entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the feed entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the feed entity.
1. Type the feed entity name in the Falcon web UI Search field and press Enter.
2. If the feed entity name appears in the search results, it was successfully created.
  For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
1. In the search results, click the checkbox next to an entity name with status of Submitted.
2. Click Schedule.
  After a few seconds a success message displays.

Creating a Process Entity Definition Using the Web UI

The process entity consumes the feeds, invokes processing logic, and can produce additional feeds. For more information, see the Apache process entity XSD.

You can use one script to work on many different Falcon-defined processes. This helps reduce code complexity and makes it easier to manage multiple workflows. Although you can call complicated Oozie workflows with one Falcon job, we recommend that you split up the complex Oozie workflows into modular steps. This helps Falcon better manage retention of any intermediary datasets. It also allows for reuse of Processes and Feeds.

When submitting a workflow.xml file as part of a process entity definition, ensure the following:

The workflow.xml path is absolute.
The file is in a location that is readable by the user submitting the entity and by service user "falcon".

Steps:

At the top of the Falcon web UI page, click Create > Process.

On the New Process page, specify the values for the following properties:

Table 7.12. General Process Properties

Property	Description
Process Name	Name of the process entity. Must be unique for the cluster. Cannot be changed after the entity is created.
Tags	Optional business labels, such as "Finance." There is no input validation on this field, so there can be duplicates.

Table 7.13. Process Detail and Engine Properties

Property	Description
Engine	Specify which workflow engine to use. Choices are Spark, Oozie, Pig, and Hive. Some properties for the entity might change, depending on the engine selected.
Workflow Name	The name of the workflow to be used. The naming criteria are as follows: Must be unique to the data center Must start with a letter Is case sensitive Can contain 2 to 100 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces
Workflow Path	The path to the workflow engine on HDFS. The workflow path must be absolute, and the workflow file must be in a location that is readable by the user submitting the entity and by user "Falcon". For example, if you are using a Pig script to define the workflow, you can set the path to `/apps/clickstream/clean-script.pig`. The workflow definition on HDFS contains the actual job that should run and it should conform to the workflow specification of the engine specified. The libraries required by the workflow should be in the `/lib` folder inside the workflow path.
Cluster	Select an existing cluster entity to specify where the workflow runs.
Inputs	The input data for the workflow. Each input maps to a feed. The path and frequency of input data is picked up from the feed entity definition. Feeds with Hive table storage can be used as inputs to a process. The workflow job starts only if all the inputs are available.
Outputs	The output data that is generated by the workflow. Each output is mapped to a feed and the output path is picked up from the feed definition.
Name	Name of the Spark application. The naming criteria are as follows: Must start with a letter Can contain 2-40 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces
Application (Spark only)	Specifies the `.jar` or Python file to be executed by Spark. Enter a valid HDFS path, including the file name.
Main Class (Spark only)	The main class for the `.jar` file specified. If the entry does not conform to a valid Java class name, the process cannot be submitted and a message is displayed: Please enter your application's main class name.
Runs On (Spark only)	Determines whether the application runs through YARN or locally on the Spark Master daemon. If the process is submitted with a master unreachable, the process is flagged as non-executable.
Mode (Spark only)	Only available if Spark is running on YARN. If run in cluster mode, it runs remotely on an ApplicationMaster. In client mode, the driver process runs locally. Important: The `yarn-client` execution mode for the Oozie Spark action is no longer supported. Oozie and Falcon continue to support `yarn-cluster` mode.
Spark Options (Spark only)	Enter any Spark options you want to implement with this process.
Spark Arguments (Spark only)	Enter the arguments for the Spark options.
Run Duration Start and End	Combined with the frequency value to determine the window of time in which a Falcon job can execute. The times at which the process begins and finishes. The workflow job starts executing after the schedule time and when all the inputs are available.The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as validity time.
Frequency	How often the process is generated. Valid frequency types are minutes, hours, days, and months.
Timezone	The timezone is associated with the duration start and end times. Default timezone is UTC.

Table 7.14. Advanced Process Properties

Property	Description
Retry Policy Type	Defines how the workflow failures should be handled. Options are Periodic, Exponential Backup, and None.
Delay Up to	The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow.
Attempts	How many times the retry policy should be implemented before the job fails.
Max Parallel Instances	How many instances of the workflow can run concurrently. Must be greater than 0. For example, a value of 2 allows two instances of the workflow to run at the same time; the next instance starts only after the running instance completes.
Order	The order in which the ready instances are picked up. Options are First-in-first-out (FIFO), Last-in-first-out (LIFO), and Last_Only.
Properties	Enter the name and value for any properties you want to assign to this process.
Access Control List	Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x). The owner must be the owner of the Falcon `staging` and `working` directories.

Click Next to view a summary of your process entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the process entity.
1. Type the entity name in the Falcon web UI Search field and press Enter.
2. If the entity name appears in the search results, it was successfully created.
  For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
1. In the search results, click the checkbox next to an entity name with status of Submitted.
2. Click Schedule.
  After a few seconds a success message displays.

Scheduling or Pausing an Entity Using the Web UI

When you create a Falcon entity in the web UI, the entity is automatically submitted to the cluster servers you specified when creating the entity. When submitted, the entity is added to the configuration store in Falcon and several validations are run against the entity definition. However, the entity must be manually scheduled before a job can run.

When you schedule an entity, the workflow and library you specified are copied to the staging path you created, and the scheduler references the workflow and library from that staging path.

	Note
	You must have completed the section called “Creating HDFS Directories for Falcon” prior to submitting the Falcon entities.

Use the Search field to locate the entity you want schedule.
In the search results list, click the checkbox next to the names of the feeds or processes with status Submitted that you want to schedule and then click Schedule.
After a few seconds, you should receive a success message for each entity.
To suspend a scheduled entity, click the checkbox next to the names of the feeds or processes with status Running and then click Pause.
You should receive a success message for each entity.

Defining Entities Using the CLI

You can manually create your Falcon entity definition files and workflow files, then submit and execute them from the CLI.

Prerequisites

Before you can begin working with Falcon, ensure that you have the following components installed on your cluster:

Hortonworks Data Platform (HDP)
Apache Falcon
See Non-Ambari Cluster Installation Guide or Ambari Automated Install Guide for Falcon installation instructions.
Apache Oozie client and server

Steps

To create a data pipeline, you must perform the following actions:

Creating a Cluster Entity Definition Using the CLI

Create the cluster specification XML file, also known as a cluster entity. There are several items to define in a cluster entity.

In the example cluster entity provided below, the following properties are defined:

Colo: Name of the Data Center
Name: File name of the Data Center
<interface>: Specify the interface type

	Important
	Ensure that the required HDFS directories have been created. Permissions on the cluster staging directory must be set to 777 (read/write/execute for owner/group/others). Only Oozie job definitions are written to the staging directory so setting permissions to 777 does not create any vulnerability. Keep the working directory permissions set to 755 (rwx/r-w/r-w).

<?xml version="1.0"?>
<!--
   Cluster Example
 -->
<cluster colo="$MyDataCenter" description="description" name="$MyDataCenter">
  <interfaces>
    <interface type="readonly" endpoint="hftp://nn:50070" version="2.4.2" /> <!-- Required for distcp for replications. -->
    <interface type="write" endpoint="hdfs://nn:8020" version="2.4.2" /> <!-- Needed for writing to HDFS-->
    <interface type="execute" endpoint="rm:8050" version="2.4.2" /> <!-- Needed to write to jobs as MapReduce-->
    <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <!-- Required. Submits Oozie jobs.-->
    <interface type=”registry" endpoint="thrift://hms:9083" version="0.13.0" /> <!--Register/deregister partitions in the Hive Metastore and get events on partition availability
-->
    <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> <!--Needed for alerts-->
  </interfaces>
  <locations>
    <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <!--HDFS directories used by the Falcon server-->
    <location name="temp" path="/tmp" />
    <location name="working" path="/apps/falcon/prod-cluster/working" />
  </locations>
</cluster>

	Note
	Additional properties must be set if you are configuring for a secure cluster. For more information, see "Configuring for Secure Clusters" in the Non-Ambari Cluster Installation guide.

For more information, see the Apache cluster entity XSD.

Creating a Feed Entity Definition Using the CLI

Create a dataset specification XML file, or feed entity:

Reference the cluster entity to determine which clusters the feed uses.
<frequency>: Specify the frequency of the feed.
<retention limit>: Choose a retention policy for the data to remain on the cluster.
<location>: Provide the HDFS path to the files.
Optional. Specify an Email Notification. The email notification tag must be placed before the <ACL> tag.
<ACL owner>: Specify the HDFS access permissions.
Optional. Specify a Late Data Handling cut-off.

<?xml version="1.0"?>
<!--
   Feed Example
 -->
<feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1">
  <frequency>hours(1)</frequency> <!--Feed run frequency-->
  <late-arrival cut-off="hours(6)”/> <!-- Late arrival cut-off -->
  <groups>churnAnalysisFeeds</groups> <!--Feed group, feeds can belong to multiple groups -->
  <tags externalSource=$MyEDW, externalTarget=Marketing> <!-- Metadata tagging -->
  <clusters> <!-- Target clusters for retention and replication. -->
    <cluster name="$MyDataCenter" type="source">
      <validity start="$date" end="$date"/>
      <retention limit="days($n)" action="delete"> <!--Currently delete is the only action available -->
    </cluster>
    <cluster name="$MyDataCenter-secondary" type="target">
      <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
      <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
	  <retention limit="days(7)" action="delete"/>
    </cluster>
  </clusters>
  <locations> <!-- Global location across clusters - HDFS paths or Hive tables -->
    <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
  </locations>
  <notification type="email" to="falcon@xyz.com"/>
  <ACL owner="hdfs" group="users" permission="0755"/>  <!-- Required for HDFS. -->
  <schema location="/none" provider="none"/> <!-- Required for HDFS. -->
</feed>

For more information, see the Apache feed entity XSD.

Creating a Process Entity Definition Using the CLI

When submitting a workflow.xml file as part of a process entity definition, ensure the following:

The workflow.xml path is absolute.
The file is in a location that is readable by the user submitting the entity and by service user "falcon".

Create the process specification XML file:

<cluster name>: Reference the cluster entity to define where the process runs.
<feed>: Reference the feed entity to define the datasets that the process uses.
Optional. Specify Late Data Handling policies or a Retry Policy.
Optional. Specify an Email Notification.

<?xml version="1.0"?>
<!--
   Process Example
 -->
<process name="process-test" xmlns="uri:falcon:process:0.1”>
    <clusters>
      <cluster name="$MyDataCenter">
        <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"
      </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order> <!--You can also use LIFO and LASTONLY but FIFO is recommended in most cases-->
    <frequency>days(1)</frequency>
    <inputs>
        <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" />
    </inputs>
    <outputs>
        <output instance="now(0,2)" feed="feed-clicks-clean" name="output" />
    </outputs>
    <workflow engine="pig" path="/apps/clickstream/clean-script.pig" />
    <retry policy="periodic" delay="minutes(10)" attempts="3"/>
    <late-process policy="exp-backoff" delay="hours(1)">
	  <late-input input="input" workflow-path="/apps/clickstream/late" />
    </late-process>
    <notification type="email" to="falcon@xyz.com, falcon_2@xyz.com"/>
</process>

	Note
	LIFO and LASTONLY are also supported schedule changes for <order>.

For more information, see the Apache process entity XSD.

Submitting and Scheduling an Entity Using the CLI

After creating the Falcon entities, you must submit and schedule the entities with the cluster servers you specified when creating the entities.

When you submit an entity, the entity is added to the configuration store in Falcon and several validations are run against the entity definition.

When you schedule an entity, the workflow and library you specified are copied to the staging path you created, and the scheduler references the workflow and library from that staging path.

	Note
	You must have completed the section called “Creating HDFS Directories for Falcon” prior to submitting the Falcon entities.

You should receive a success message after submitting and scheduling each entity.

Submit your entities to Falcon, ensuring you specify the correct entity type for each file.
1. Submit your cluster entity.
  For example, to submit $sampleClusterFile.xml:
```
falcon entity -type cluster -submit -file $sampleClusterFile.xml
```
2. Submit your dataset or feed entity.
  For example, to submit $sampleFeedFile.xml:
```
falcon entity -type feed -submit -file $sampleFeedFile.xml
```
3. Submit your process entity.
  For example, to submit $sampleProcessFile.xml:
```
falcon entity -type process -submit -file $sampleProcessFile.xml
```
Schedule your feed and process entities.
1. Schedule your feed.
  For example, to schedule $feedName:
```
falcon entity -type feed -schedule -name $feedName
```
2. Schedule your process.
  For example, to schedule $processName:
```
falcon entity -type process -schedule -name $processName
```

Your data pipeline is now deployed with basic necessary information to run Oozie jobs, Spark jobs, Pig scripts, and Hive queries. You can now explore other sections such as Late Data Handling or Retry Policy.

	Tip
	You can also create the directories from the Ambari UI File View. You must be logged in as user falcon.