Managing Data Operating System
Also available as:
PDF
loading table of contents...

Configure GPU Scheduling and Isolation

On an Ambari cluster, you can configure GPU scheduling and isolation. On a non-Ambari cluster, you must configure certain properties in the capacity-scheduler.xml, resource-types.xml, and yarn-site.xml files. Currently only Nvidia GPUs are supported in YARN.

  • YARN NodeManager must be installed with the Nvidia drivers.

Enable GPU scheduling and isolation on an Ambari cluster

  1. Select YARN > CONFIGS on the Ambari dashboard.
  2. Click GPU Scheduling and Isolation under GPU.
  3. In the Absolute path of nvidia-smi on NodeManagers field, enter the absolute path to the nvidia-smi GPU discovery executable. For example, /usr/local/bin/nvidia-smi
  4. Click Save, and then restart all the cluster components that require a restart.
If the NodeManager fails to start, and you see the following error:
INFO  gpu.GpuDiscoverer (GpuDiscoverer.java:initialize(240)) - Trying to discover GPU information ...
WARN  gpu.GpuDiscoverer (GpuDiscoverer.java:initialize(247)) - Failed to discover GPU information from system, 
exception message:ExitCodeException exitCode=12:  continue... 

Export the LD_LIBRARY_PATH in the yarn -env.sh using the following command: export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH

Enable GPU scheduling and isolation on a non-Ambari cluster

DominantResourceCalculator must be configured first before you enable GPU scheduling/isolation. Configure the following property in the/etc/hadoop/conf/capacity-scheduler.xml file

Property: yarn.scheduler.capacity.resource-calculator

Value: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

  1. Enable GPU scheduling in the /etc/hadoop/conf/resource-types.xml file on the ResourceManager and NodeManager hosts:

    Property: yarn.resource-types

    Value: yarn.io/gpu

    Example:

    <configuration>
      <property>
         <name>yarn.resource-types</name>
         <value>yarn.io/gpu</value>
      </property>
    </configuration>
  2. Enable GPU isolation in the the /etc/hadoop/conf/yarn-site.xml file on the NodeManager host:

    Property: yarn.nodemanager.resource-plugins

    Value: yarn.io/gpu

    Example:

    <configuration>
      <property>
         <name>yarn.nodemanager.resource-plugins</name>
         <value>yarn.io/gpu</value>
      </property>
    </configuration>
  3. Set the following advanced properties in the /etc/hadoop/conf/yarn-site.xml file on the NodeManager host:
    • To allow GPU devices:

      Property: yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices

      Value: auto

      Note
      Note
      The auto setting enables YARN to automatically detect and manage GPU devices. For other options, see YARN-7223.
    • To allow YARN NodeManager to to locate discovery executable:

      Property: yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables

      Value: <absolute_path_to_nvidia-smi_binary>
      Note
      Note
      Supports only nvidia-smi.

      Example: /usr/local/bin/nvidia-smi

  4. Set the following property in the /etc/hadoop/conf/yarn-site.xml file on the NodeManager host to automatically mount cgroup sub-devices:
    • Property: yarn.nodemanager.linux-container-executor.cgroups.mount

      Value: true

  5. Set the following configuration in the /etc/hadoop/conf/container-executor.cfg to run GPU applications under non-Docker environment:
    • In the GPU section, set:

      Property: module.enabled=true

    • In the cgroups section, set:

      Property: root=/sys/fs/cgroup

      Note
      Note
      This should be same as yarn.nodemanager.linux-container-executor.cgroups.mount-path in the yarn-site.xml file
      Property: yarn-hierarchy=yarn
      Note
      Note
      This should be same as yarn.nodemanager.linux-container-executor.cgroups.hierarchy in the yarn-site.xml file