Chapter 4. Resource Manager High Availability

This chapter provides instructions on setting up the ResourceManager (RM) High Availability (HA) feature in an HDFS cluster. The Active and Standby ResourceManagers embed the Zookeeper-based ActiveStandbyElector to determine which ResourceManager should be active.

[Note]Note

This document assumes that an existing HDP cluster has been manually installed and deployed. It provides instructions on how to manually enable ResourceManager HA on top of the existing cluster.

The ResourceManager is a single point of failure (SPOF) in an HDFS cluster. Each cluster has a single ResourceManager, and if that machine or process becomes unavailable, the entire cluster is unavailable until the ResourceManager is either restarted or started on a separate node. This situation impacts the total availability of the HDFS cluster in two ways:

  • Unplanned events, such as a node failure, cause the cluster to be unavailable until an operator restarts the ResourceManager.

  • Planned maintenance events, such as software or hardware upgrades on the ResourceManager node, cause periods of cluster downtime.

The ResourceManager HA feature addresses these problems. This feature enables you to run redundant ResourceManagers in the same cluster in an Active/Passive configuration with a hot standby. This mechanism thus facilitates either a fast failover to the standby ResourceManager during node failure, or a graceful administrator-initiated failover during planned maintenance.