2. Configure Tez

 2.1. Tez Configuration

Perform the following steps to configure Tez for your Hadoop cluster:

  1. Create a tez-site.xml configuration file and place it in the /etc/tez/conf configuration directory.

    A sample tez-site.xml file is included in the configuration_files/tez folder in the HDP companion files.

  2. Create the $TEZ_CONF_DIR environment variable and set it to to the location of the tez-site.xml file.

    export TEZ_CONF_DIR=/etc/tez/conf
  3. Create the $TEZ_JARS environment variable and set it to the location of the Tez jars and their dependencies.

    export TEZ_JARS=/usr/lib/tez/*:/usr/lib/tez/lib/*
    [Note]Note

    Be sure to include the asterisks (*) in the above command.

  4. Configure the tez.lib.uris property with the HDFS paths containing the Tez jar files in the tez-site.xml file.

    ...
    <property>
      <name>tez.lib.uris</name>
      <value>${fs.default.name}/apps/tez/,${fs.default.name}/apps/tez/lib/</value>
    </property>
    ...

  5. Add $TEZ_CONF_DIR and $TEZ_JARS to the $HADOOP_CLASSPATH environment variable.

    export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH

    Where:

    • $TEZ_CONF_DIR is the location of tez-site.xml.

    • $TEZ_JARS is the location of Tez jars and their dependencies.

 2.2. Tez Configuration Parameters

 

Table 10.1. Tez Configuration Parameters

Configuration ParameterDescriptionDefault Value
tez.lib.urisLocation of the Tez jars and their dependencies. Tez applications download required jar files from this location, so it should be public accessible.N/A
tez.am.log.levelRoot logging level passed to the Tez Application Master.INFO
tez.staging-dirThe staging directory used by Tez when application developers submit DAGs, or Dynamic Acyclic Graphs. Tez creates all temporary files for the DAG job in this directory./tmp/${user.name}/staging
tez.am.resource.memory.mbThe amount of memory in MB that YARN will allocate to the Tez Application Master. The size increases with the size of the DAG.1536
tez.am.java.optsJava options for the Tez Application Master process. The value specified for -Xmx value should be less than specified in tez.am.resource.memory.mb, typically 512 MB less to account for non-JVM memory in the process.-server -Xmx1024m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC
tez.am.shuffle-vertex-manager.min-src-fractionIn case of a Shuffle operation over a Scatter-Gather edge connection, Tez may start data consumer tasks before all the data producer tasks complete in order to overlap the shuffle IO. This parameter specifies the fraction of producer tasks which should complete before the consumer tasks are scheduled. The percentage is expressed as a decimal, so the default value of 0.2 represents 20%.0.2
tez.am.shuffle-vertex-manager.max-src-fractionIn case of a Shuffle operation over a Scatter-Gather edge connection, Tez may start data consumer tasks before all the data producer tasks complete in order to overlap the shuffle IO. This parameter specifies the fraction of producer tasks which should complete before all consumer tasks are scheduled. The number of consumer tasks ready for scheduling scales linearly between min-fraction and max-fraction. The percentage is expressed as a decimal, so the default value of 0.4 represents 40%.0.4
tez.am.am-rm.heartbeat.interval-ms.maxThis parameter determines how frequently the Tez Application Master asks the YARN Resource Manager for resources in milliseconds. A low value can overload the Resource Manager.250
tez.am.grouping.split-wavesSpecifies the number of waves, or the percentage of queue container capacity, used to process a data set where a value of1 represents 100% of container capacity. The Tez Application Master considers this parameter value, the available cluster resources, and the resources required by the application to calculate parallelism, or the number of tasks to run. Processing queries with additional containers leads to lower latency. However, resource contention may occur if multiple users run large queries simultaneously.Tez Default:1.4; Hive Default: 1.7
tez.am.grouping.min-sizeSpecifies the lower bound of the size of the primary input to each task when The Tez Application Master determines the parallelism of primary input reading tasks. This configuration property prevents input tasks from being too small, which prevents the parallelism for the tasks being too large.16777216
tez.am.grouping.max-sizeSpecifies the upper bound of the size of the primary input to each task when the Tez Application Master determines the parallelism of primary input reading tasks. This configuration property prevents input tasks from being too large, which prevents their parallelism from being too small.1073741824
tez.am.container.reuse.enabledA container is the unit of resource allocation in YARN. This configuration parameter determines whether Tez will reuse the same container to run multiple tasks. Enabling this parameter improves performance by avoiding the memory overhead of reallocating container resources for every task. However, disable this parameter if the tasks contain memory leaks or use static variables.true
tez.am.container.reuse.rack-fallback.enabledSpecifies whether to reuse containers for rack-local tasks. This configuration parameter is ignored unless tez.am.container.reuse.enabled is enabled.true
tez.am.container.reuse.non-local-fallback.enabledSpecifies whether to reuse containers for non-local tasks. This configuration parameter is ignored unless tez.am.container.reuse.enabled is enabled.true
tez.am.container.session.delay-allocation-millisDetermines when a Tez session releases its containers while not actively servicing a query. Specify a value of -1 to never release an idle container in a session. However, containers may still be released if they do not meet resource or locality needs. This configuration parameter is ignored unless tez.am.container.reuse.enabled is enabled.10000 (10 seconds)
tez.am.container.reuse.locality.delay-allocation-millisThe amount of time to wait in milliseconds before assigning a container to the next level of locality. The three levels of locality in ascending order are NODE, RACK, and NON_LOCAL.250
tez.task.get-task.sleep.interval-ms.maxDetermines the maximum amount of time in milliseconds a container agent waits before asking The Tez Application Master for another task. Tez runs an agent on a container in order to remotely launch tasks. A low value may overload the Application Master.200
tez.session.client.timeout.secsSpecifies the amount of time in seconds to wait for the Application Master to start when trying to submit a DAG from the client in session mode.180
tez.session.am.dag.submit.timeout.secsSpecifies the amount of time in seconds that the Tez Application Master waits for a DAG to be submitted before shutting down. The value of this property is used when the Tez Application Manager is running in Session mode, which allows multiple DAGs to be submitted for execution. The idle time between DAG submissions should not exceed this time.300
tez.runtime.intermediate-output.should-compressSpecifies whether Tez should compress intermediate output.false
tez.runtime.intermediate-output.compress.codecSpecifies the codec to used when compressing intermediate output. This configuration is ignored unless tez.runtime.intermediate-output.should-compress is enabled.org.apache.hadoop.io.compress.SnappyCodec
tez.runtime.intermediate-input.is-compressedSpecifies whether intermediate output is compressed.false
tez.runtime.intermediate-input.compress.codecSpecifies the codec to use when reading intermediate compressed input. This configuration property is ignored unless tez.runtime.intermediate-input.is-compressed is enabled.org.apache.hadoop.io.compress.SnappyCodec
tez.yarn.ats.enabledSpecifies that Tez should start the TimeClient for sending information to the Timeline Server.false

 2.3. Configuring Tez with the Capacity Scheduler

You can use the tez.queue.name property to specify which queue will be used for Tez jobs. Currently the Capacity Scheduler is the default Scheduler in HDP. In general, this is not limited to the Capacity Scheduler, but applies to any YARN queue.

If no queues have been configured, the default queue will be used, which means that 100% of the cluster capacity will be used when running Tez jobs. If queues have been configured, a queue name must be configured for each YARN application.

Setting tez.queue.name in tez-site.xml would apply to Tez applications that use that configuration file. To assign separate queues for each application, you would need separate tez-site.xml files, or you could have the application pass this configuration to Tez while submitting the Tez DAG.

For example, in Hive you would use the the tez.queue.name property in hive-site.xml to specify the queue to be used for Hive-on-Tez jobs. To assign Hive-on-Tez jobs to use the "engineering" queue, you would add the following property to hive-site.xml:

<property>
    <name>tez.queue.name</name>
    <value>engineering</value>
</property>

Setting this configuration property in hive-site.xml will affect all Hive queries that read that configuration file.

To assign Hive-on-Tez jobs to use the "engineering" queue in a Hive query, you would use the following command in the Hive shell or in a Hive script:

set tez.queue.name=engineering;


loading table of contents...