Configuring Fault Tolerance
Also available as:
PDF
loading table of contents...

Configuring ResourceManager High Availability

You can configure ResourceManager High Availability to avoid windows of cluster downtime.

This guide provides instructions on setting up the ResourceManager (RM) High Availability (HA) feature in a HDFS cluster. The Active and Standby ResourceManagers embed the ZooKeeper-based ActiveStandbyElector to determine which ResourceManager should be active.

Note
Note

This guide assumes that an existing HDP cluster has been manually installed and deployed. It provides instructions on how to manually enable ResourceManager HA on top of the existing cluster.

The ResourceManager is a single point of failure (SPOF) in an HDFS cluster. Each cluster has a single ResourceManager, and if that machine or process become unavailable, the entire cluster will be unavailable until the ResourceManager is either restarted or started on a separate machine. This situation impacts the total availability of the HDFS cluster in two major ways:

  • In the case of an unplanned event such as a machine crash, the cluster will be unavailable until an operator restarts the ResourceManager.

  • Planned maintenance events such as software or hardware upgrades on the ResourceManager machine result in windows of cluster downtime.

The ResourceManager HA feature addresses these problems. This feature enables you to run redundant ResourceManagers in the same cluster in an Active/Passive configuration with a hot standby. This mechanism thus facilitates either a fast failover to the standby ResourceManager during machine crash, or a graceful administrator-initiated failover during planned maintenance.