Data Movement and Integration
Also available as:
PDF
loading table of contents...

Chapter 14. Using HDP for Workflow and Scheduling With Oozie

Hortonworks Data Platform deploys Apache Oozie for your Hadoop cluster.

Oozie is a server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs, such as MapReduce, Pig, Hive, Sqoop, HDFS operations, and sub-workflows. Oozie supports coordinator jobs, which are sequences of workflow jobs that are created at a given frequency and start when all of the required input data is available.

A command-line client and a browser interface allow you to manage and administer Oozie jobs locally or remotely.

After installing an HDP 2.x cluster by using Ambari 1.5.x, access the Oozie web UI at the following URL:

http://{your.oozie.server.hostname}:11000/oozie

[Important]Important

The yarn-client execution mode for the Oozie Spark action is no longer supported. Oozie and Falcon continue to support yarn-cluster mode.

Setting the Oozie Client Environment

The Oozie client requires JDK 1.6 or higher and must be available on all systems where Oozie command line will be run. Java must be included on the path or $JAVA_HOME must be set to point to a Java 6+ JDK/JRE.

This is a behavior change for the Oozie client from previous releases.

Additional Oozie Resources

For additional Oozie documentation, use the following resources:

ActiveMQ With Oozie and Falcon

You must configure an ActiveMQ URL in Apache Oozie and Apache Falcon components, if those components communicate using an ActiveMQ server that is running on a different host.

If either of the following circumtances apply to your environment, perform the indicated action.

  • If Falcon starts ActiveMQ server by default, but Oozie is running on a different host: Set the ActiveMQ server URL in Oozie.

  • If Falcon and Oozie are communicating with a standalone ActiveMQ server: Set the ActiveMQ server URL in both Oozie and Falcon.

To configure ActiveMQ URL in Oozie, add the following property via Ambari and restart Oozie:

  1. In Ambari, navigate to Services > Oozie > Configs.

  2. Add the following key/value pair as a property in the Custom oozie-site section.

    Key=

    oozie.jms.producer.connection.properties

    Value=

    java.naming.factory.initial#org.apache.activemq.jndi.ActiveMQInitialContextFactory;java.naming.provider.url#tcp://{ActiveMQ-server-host}:61616;connectionFactoryNames#ConnectionFactory
  3. Navigate to Services > Falcon > Configs.

  4. Add the following value for broker.url in the Falcon startup.properties section.

    *.broker.url=tcp://{ActiveMQ-server-host}:61616 
  5. Click Service Actions > Restart All to restart the Falcon service.

Troubleshooting:

When upgrading Falcon in HDP 2.5 or later, you might encounter the following error when starting the ActiveMQ server:

ERROR - [main:] ~ Failed to start ActiveMQ JMS Message Broker. Reason: 
java.lang.NegativeArraySizeException (BrokerService:528)

If you encounter this error, follow these steps to delete the ActiveMQ history and then restart Falcon. If you want to retain the history, be sure to back up the ActiveMQ history prior to deleting it.

cd <ACTIVEMQ_DATA_DIR>
rm -rf ./localhost
cd /usr/hdp/current/falcon-server
su -l <FALCON_USER> 
./bin/falcon-stop
./bin/falcon-start

Configuring Pig Scripts to Use HCatalog in Oozie Workflows

To access HCatalog with a Pig action in an Oozie workflow, you need to modify configuration information to point to the Hive metastore URIs.

There are two methods for providing this configuration information. Which method you use depends upon how often your Pig scripts access the HCatalog.

Configuring Individual Pig Actions to Access HCatalog

If only a few individual Pig actions access HCatalog, do the following:

  1. Identify the URI (host and port) for the Thrift metastore server.

    1. In Ambari, click Hive > Configs > Advanced.

    2. Make note of the URI in the hive.metastore.uris field in the General section.

      This information is also stored in the hive.default.xml file.

  2. Add the following two properties to the <configuration> elements in each Pig action.

    [Note]Note

    Replace [host:port(default:9083)] in the example below with the host and port for the Thrift metastore server.

    <configuration>
        <property>
            <name>hive.metastore.uris</name>
            <value>thrift://[host:port(default:9083)]</value>
            <description>A comma separated list of metastore uris the client can use to contact the
            metastore server.</description>
        </property>
        <property>
            <name>oozie.action.sharelib.for.pig</name>
            <value>pig,hive,hcatalog</value>
            <description>A comma separated list of libraries to be used by the Pig action.</description>
        </property>
    </configuration>
    

Configuring All Pig Actions to Access HCatalog

If all of your Pig actions access HCatalog, do the following:

  1. Add the following line to the job.properties files, located in your working directory:

    oozie.action.sharelib.for.pig=pig,hive,hcatalog
    <!-- A comma separated list of libraries to be used by the Pig action.-->
    
  2. Identify the URI (host and port) for the Thrift metastore server.

    1. In Ambari, click Hive > Configs > Advanced.

    2. Make note of the URI in the hive.metastore.uris field in the General section.

      This information is also stored in the hive.default.xml file.

  3. Add the following property to the <configuration> elements in each Pig action.

    [Note]Note

    Replace [host:port(default:9083)] in the example below with the host and port for the Thrift metastore server.

    <configuration>
        <property>
            <name>hive.metastore.uris</name>
            <value>thrift://[host:port(default:9083)]</value>
            <description>A comma separated list of metastore uris the client can use to contact the
            metastore server.</description>
        </property>
        </configuration>