Hadoop MapReduce Next Generation - Fair Scheduler

[ Go Back ]

Purpose

This document describes the FairScheduler, a pluggable scheduler for Hadoop which provides a way to share large clusters. NOTE: The Fair Scheduler implementation is currently under development and should be considered experimental.

Introduction

Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness developed by Ghodsi et al. When there is a single app running, that app uses the entire cluster. When other apps are submitted, resources that free up are assigned to the new apps, so that each app eventually on gets roughly the same amount of resources. Unlike the default Hadoop scheduler, which forms a queue of apps, this lets short apps finish in reasonable time while not starving long-lived apps. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with app priorities - the priorities are used as weights to determine the fraction of total resources that each app should get.

The scheduler organizes apps further into "queues", and shares resources fairly between these queues. By default, all users share a single queue, called “default”. If an app specifically lists a queue in a container resource request, the request is submitted to that queue. It is also possible to assign queues based on the user name included with the request through configuration. Within each queue, a scheduling policy is used to share resources between the running apps. The default is memory-based fair sharing, but FIFO and multi-resource with Dominant Resource Fairness can also be configured. Queues can be arranged in a hierarchy to divide resources and configured with weights to share the cluster in specific proportions.

In addition to providing fair sharing, the Fair Scheduler allows assigning guaranteed minimum shares to queues, which is useful for ensuring that certain users, groups or production applications always get sufficient resources. When a queue contains apps, it gets at least its minimum share, but when the queue does not need its full guaranteed share, the excess is split between other running apps. This lets the scheduler guarantee capacity for queues while utilizing resources efficiently when these queues don't contain applications.

The Fair Scheduler lets all apps run by default, but it is also possible to limit the number of running apps per user and per queue through the config file. This can be useful when a user must submit hundreds of apps at once, or in general to improve performance if running too many apps at once would cause too much intermediate data to be created or too much context-switching. Limiting the apps does not cause any subsequently submitted apps to fail, only to wait in the scheduler's queue until some of the user's earlier apps finish.

Hierarchical queues with pluggable policies

The fair scheduler supports hierarchical queues. All queues descend from a queue named "root". Available resources are distributed among the children of the root queue in the typical fair scheduling fashion. Then, the children distribute the resources assigned to them to their children in the same fashion. Applications may only be scheduled on leaf queues. Queues can be specified as children of other queues by placing them as sub-elements of their parents in the fair scheduler configuration file.

A queue's name starts with the names of its parents, with periods as separators. So a queue named "queue1" under the root named, would be referred to as "root.queue1", and a queue named "queue2" under a queue named "parent1" would be referred to as "root.parent1.queue2". When referring to queues, the root part of the name is optional, so queue1 could be referred to as just "queue1", and a queue2 could be referred to as just "parent1.queue2".

Additionally, the fair scheduler allows setting a different custom policy for each queue to allow sharing the queue's resources in any which way the user wants. A custom policy can be built by extending org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy. FifoPolicy, FairSharePolicy (default), and DominantResourceFairnessPolicy are built-in and can be readily used.

Certain add-ons are not yet supported which existed in the original (MR1) Fair Scheduler. Among them, is the use of a custom policies governing priority “boosting” over certain apps.

Installation

To use the Fair Scheduler first assign the appropriate scheduler class in yarn-site.xml:

<property>
  <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

Configuration

Customizing the Fair Scheduler typically involves altering two files. First, scheduler-wide options can be set by adding configuration properties in the fair-scheduler.xml file in your existing configuration directory. Second, in most cases users will want to create a manifest file listing which queues exist and their respective weights and capacities. The location of this file is flexible - but it must be declared in fair-scheduler.xml.

  • yarn.scheduler.fair.allocation.file
    • Path to allocation file. An allocation file is an XML manifest describing queues and their properties, in addition to certain policy defaults. This file must be in XML format as described in the next section.
  • yarn.scheduler.fair.user-as-default-queue
    • Whether to use the username associated with the allocation as the default queue name, in the event that a queue name is not specified. If this is set to "false" or unset, all jobs have a shared default queue, called "default". Defaults to true.
  • yarn.scheduler.fair.preemption
    • Whether to use preemption. Note that preemption is experimental in the current version. Defaults to false.
  • yarn.scheduler.fair.sizebasedweight
    • Whether to assign shares to individual apps based on their size, rather than providing an equal share to all apps regardless of size. When set to true, apps are weighted by the natural logarithm of one plus the app's total requested memory, divided by the natural logarithm of 2. Defaults to false.
  • yarn.scheduler.fair.assignmultiple
    • Whether to allow multiple container assignments in one heartbeat. Defaults to false.
  • yarn.scheduler.fair.max.assign
    • If assignmultiple is true, the maximum amount of containers that can be assigned in one heartbeat. Defaults to -1, which sets no limit.
  • locality.threshold.node
    • For applications that request containers on particular nodes, the number of scheduling opportunities since the last container assignment to wait before accepting a placement on another node. Expressed as a float between 0 and 1, which, as a fraction of the cluster size, is the number of scheduling opportunities to pass up. The default value of -1.0 means don't pass up any scheduling opportunities.
  • locality.threshold.rack
    • For applications that request containers on particular racks, the number of scheduling opportunities since the last container assignment to wait before accepting a placement on another rack. Expressed as a float between 0 and 1, which, as a fraction of the cluster size, is the number of scheduling opportunities to pass up. The default value of -1.0 means don't pass up any scheduling opportunities.

Allocation file format

The allocation file must be in XML format. The format contains four types of elements:

  • Queue elements, which represent queues. Each may contain the following properties:
    • minResources: minimum resources the queue is entitled to, in the form "X mb, Y vcores". If a queue's minimum share is not satisfied, it will be offered available resources before any other queue under the same parent. Under the single-resource fairness policy, a queue is considered unsatisfied if its memory usage is below its minimum memory share. Under dominant resource fairness, a queue is considered unsatisfied if its usage for its dominant resource with respect to the cluster capacity is below its minimum share for that resource. If multiple queues are unsatisfied in this situation, resources go to the queue with the smallest ratio between relevant resource usage and minimum. Note that it is possible that a queue that is below its minimum may not immediately get up to its minimum when it submits an application, because already-running jobs may be using those resources.
    • maxResources: maximum resources a queue is allowed, in the form "X mb, Y vcores". A queue will never be assigned a container that would put its aggregate usage over this limit.
    • maxRunningApps: limit the number of apps from the queue to run at once
    • weight: to share the cluster non-proportionally with other queues. Weights default to 1, and a queue with weight 2 should receive approximately twice as many resources as a queue with the default weight.
    • schedulingPolicy: to set the scheduling policy of any queue. The allowed values are "fifo"/"fair"/"drf" or any class that extends org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy. Defaults to "fair". If "fifo", apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app's requests.
    • aclSubmitApps: a list of users that can submit apps to the queue. A (default) value of "*" means that any users can submit apps. A queue inherits the ACL of its parent, so if a queue2 descends from queue1, and user1 is in queue1's ACL, and user2 is in queue2's ACL, then both users may submit to queue2.
    • minSharePreemptionTimeout: number of seconds the queue is under its minimum share before it will try to preempt containers to take resources from other queues.
  • User elements, which represent settings governing the behavior of individual users. They can contain a single property: maxRunningApps, a limit on the number of running apps for a particular user.
  • A userMaxAppsDefault element, which sets the default running app limit for any users whose limit is not otherwise specified.
  • A fairSharePreemptionTimeout element, number of seconds a queue is under its fair share before it will try to preempt containers to take resources from other queues.

    An example allocation file is given here:

    <?xml version="1.0"?>
    <allocations>
      <queue name="sample_queue">
        <minResources>10000 mb</minResources>
        <maxResources>90000 mb</maxResources>
        <maxRunningApps>50</maxRunningApps>
        <weight>2.0</weight>
        <schedulingPolicy>fair</schedulingPolicy>
        <queue name="sample_sub_queue">
          <minResources>5000 mb</minResources>
        </queue>
      </queue>
      <user name="sample_user">
        <maxRunningApps>30</maxRunningApps>
      </user>
      <userMaxAppsDefault>5</userMaxAppsDefault>
    </allocations>

    Note that for backwards compatibility with the original FairScheduler, "queue" elements can instead be named as "pool" elements.