Ganglia Service Graphs

HDP uses Ganglia to offer service and server utilization graphs at various levels Dashboard (sta­tistics for the entire cluster), MapReduce, HDFS, and HBase. HDP offers service and server utili­zation graphs for the Dashboard, MapReduce, HDFS, and HBase user interfaces.

Cluster HDFS I/O last hour

This graph shows the number of bytes read and written to HDFS per second, aggregated across all the data nodes. The graph provides a view of the overall load on the HDFS cluster, in terms of data movement activity at any given point in time.

Map/Reduce slot utilization last hour

This graph shows the number of map/reduce slots occupied and reserved, with the total number of cluster map/reduce slots available at any given time. These graphs provide a view into the overall utilization of cluster hardware against the Map/Reduce job workload. The graphs also show the number of slots reserved by the capacity scheduler at any given time. Typically, jobs with a high memory requirement need multiple slots on the node. The capacity scheduler reserves free slots if sufficient slots are not currently available on the node for a given job task. (By default, HDP uses capacity scheduler).

Cluster Map/Reduce statistics

This graph shows the average number of waiting maps and reduce tasks for currently running jobs at any given point in time. Analogous to CPU Load average statistics, this graph shows the load on the Map/Reduce cluster. It may be necessary for the load administrator to provision more slot capacity to meet the overall needs of job latency and throughput.

Best practices - Monitoring

Comparing this graph against the Slot utilization graphs may be useful. For example, a higher load on the Map/Reduce cluster may be due to a temporary reduction in the available slot capac­ity due to down task trackers. Comparing this graph against the Cluster HDFS I/O graph may also be useful. For example, a higher load on the Map/Reduce cluster may be due to existing running tasks putting more load on HDFS, which would be evident through increased bytes read/written in a similar time frame.

Dashboard aggregated server utilization graphs

The following service graphs display system metrics aggregated across all the cluster slave nodes.

Best practices - Monitoring

Typically, master nodes have a higher capacity of system resources and should be monitored individually, which is why they are not included in the following graph aggregates (Nagios pro­vides alerts for system resource usage on master nodes). These graphs are helpful in observing the overall system utilization, as well as any changes in the cluster workload characteristics in terms of their system resource usage.

Cluster load last hour

This graph shows a 1 minute load average aggregated across all the cluster nodes, the total capacity of all nodes (a dip in the graph indicates that nodes are down), and total CPU cores capacity (the number of cluster nodes multiplied by the number of CPU cores per node), and the total number of processes running aggregated across all the nodes.

Cluster memory last hour

This graph shows the total and used memory, and the available swap, buffered, cached, and shared memory aggregated across the slave nodes.

Cluster CPU last hour

This graph shows the aggregated CPU stats, including the percent usage of system, user, I/O wait, and idle CPU.

Cluster Network last hour

This graph shows the aggregated network traffic (bytes in and out) across the slave nodes.

MapReduce service graphs

The following MapReduce service graphs are available on HDP Monitoring Dashboard:

Cluster jobs submitted last hour

This graph shows the average number of jobs submitted at any given point in time.

Cluster jobs running last hour

This graph shows the number of jobs running at any given point in time.

Cluster jobs completed last hour

This graph shows the number of jobs completed at any given point in time. The graph provides a view of the cluster job throughput. For example, under a constant workload, a lower job through­put could indicate a slow network, increased disk I/O latencies, and thus, overall increased job latencies.

Cluster jobs failed last hour

This graph shows the average number of failed jobs at any given point in time. An increase in the failure rate may indicate a cluster wide problem.

Cluster jobs heartbeats last hour

This graph shows the heartbeat bits per seconds received by JobTracker from TaskTrackers.

Best practices - Monitoring

If no heartbeats are sent to the JobTracker, the JobTracker may be down or may have lost con­nectivity to slave nodes. A lower heartbeat rate can also indicate a network issue isolating slave nodes from the JobTracker.

Cluster average RPC wait time last hour

This graph shows the average wait time for JobTracker remote procedure calls (RPCs).  An increase in the average wait time for RPCs indicates a slowdown in the JobTracker RPC pro­cessing, which could be due to the JVM temporarily performing garbage collection tasks.

Cluster RPC queue time number of operations last hour

This graph shows the average number of RPCs per second made to the JobTracker at any given point in time.

Cluster JVM garbage collection statistics last hour

This graph shows the time in milli­seconds JobTracker is using for garbage collection. An increase in the time used for garbage collection can lead to a slowdown of the JobTracker RPC processing.

HDFS service graphs

The following HDFS service graphs are available:

HDFS capacity remaining last month

This graph shows the total HDFS capacity remaining at any given point in time. This is a slow changing counter and is shown over a period of months.

Best practices - Monitoring

This graph is useful in understanding how quickly HDFS space is filling up, enabling the adminis­trator to provision capacity in a timely manner.

HDFS under-replicated blocks

This graph shows the number of under-replicated blocks at any given point in time.

Best practices - Monitoring

A sudden loss of the data node/disk or an increasing replication factor for HDFS files may result in under-replicated blocks. Pending replication blocks indicate corresponding files are under-rep­licated and are susceptible to corruption. The Block Pending Replication graph shows the rate at which the system is able to replicate all the blocks.

HDFS blocks pending replication

This graph shows the number of HDFS block replication requests pending at any given point in time. A NameNode typically schedules replication for many blocks at a time. The graph provides a view of the current data replication load on the HDFS cluster.

Best practices - Monitoring

Having the system in this state for a long time may indicate a problem with the network, a lack of space in the remaining data nodes, or an overall high read/write request load on the data nodes.

NameNode operation counts

This is an overlay graph showing the average number of file create, delete and list operations per second at any given point in time.

NameNode heartbeats

This graph shows the heartbeat bits per seconds received by the NameNode from data nodes.

Best practices - Monitoring

If no heartbeats are sent to the NameNode, either the NameNode is down or the NameNode has lost connectivity to the slave nodes. A lower heart­beat rate can also indicate a network issue iso­lating slave nodes from the NameNode.

NameNode JVM garbage collection

This graph shows the time in milliseconds the NameNode is using for garbage collection. A NameNode keeps all its files’ metadata in memory and may often need to perform garbage col­lection, for example, when large number of files are deleted from the file system.

Best practices - Monitoring

An increase in the time used for garbage collection can lead to a slowdown of the NameNode RPC processing.

NameNode heap memory used

This graph shows the heap memory used by the NameNode at any given point in time. Heap space usage is very critical for the NameNode, as it keeps all its file system metadata in memory.

Best practices - Monitoring

A NameNode approaching maximum heap usage indicates that HDFS  is reaching its inode (number of files) capacity. An increase in heap usage will also cause a NameNode to aggres­sively perform garbage collection, resulting in a slowdown of client request (RPC) processing. HDFS provides tools such as Hadoop Archive (HAR) to archive a large number of files into a small number of large files, without affecting the original file view.

NameNode RPC average waiting time

This graph shows the average wait time for client requests (RPCs) in the NameNode queue.

Best practices - Monitoring

An increase in the average wait time for RPCs indicates a slowdown in the NameNode RPC pro­cessing, which could be due to the JVM temporarily performing garbage collection tasks.

HBase service graphs

The following HBase service graphs are available:

HBase Master cluster requests last hour

This graph shows the total number of requests per second to HBase at any given point in time.

HBase RegionServers read requests

This graph shows the total number of read requests aggregated across all region servers (slaves).

HBase RegionServers Write Requests

This graph shows the total number of write requests aggregated across all region servers (slaves).

HBase RegionServers regions served

This graph shows the number of regions served by the RegionServer aggregated across all the region servers (slaves). This is a slow changing metric and is shown over a period of one month.

HBase RegionServers average read latency

This graph shows the HDFS read latency for the Region Servers aggregated across all the region servers (slaves).

HBase RegionServers average write latency

This graph shows the HDFS write latency for the Region Servers aggregated across all the region servers (slaves).

HBase RegionServers average JVM garbage collection

This graph shows the time in milliseconds, aggregated across all the region servers (slaves), used for garbage collection.

HBase RegionServers RPC Average Waiting Time

This graph shows the average wait time aggregated across all the region servers (slaves).