HDFS high availability alerts
Descriptions, potential causes and possible rememdies for alerts related to HDFS high availability.
|Alert||Alert Type||Description||Potential Causes||Possible Remedies|
|JournalNode Web UI||WEB||This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening on the network for the configured critical threshold, given in seconds.||
The JournalNode process is down or not responding.
The JournalNode is not down but is not listening to the correct network port/address.
|Check if the JournalNode process is running.|
|NameNode High Availability Health||SCRIPT||This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.||The Active, Standby or both NameNode processes are down.||
On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode host/process using Ambari Web.
On each host running NameNode, run the netstat-tuplpn command to check if the NameNode process is bound to the correct network port.
|Percent JournalNodes Available||AGGREGATE||This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.||
JournalNodes are down.
JournalNodes are not down but are not listening to the correct network port/address.
|Check for dead JournalNodes in Ambari Web.|
|ZooKeeper Failover Controller Process||PORT||This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network.||The ZKFC process is down or not responding.||Check if the ZKFC process is running.|