5.3.4. Percent DataNodes down alert

This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. It uses the check_aggregate plugin to aggregate the results of Data node process down alert checks.

 5.3.4.1. Potential causes
  • The DataNodes are down

  • The DataNodes are not down but are not listening to the correct network port/address

  • The Nagios server cannot connect to one or more DataNodes

 5.3.4.2. Possible remedies
  • Check for dead DataNodes in the Services list.

  • Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes

  • Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port.

  • Use ping to check the network connection between the Nagios server and the DataNodes.


loading table of contents...