Verify that a hung NameNode triggers the fail over

This test verifies that the VM does not fail immediately after the NameNode process is hung. The monitor considers this time period as a Garbage Collection-related pause. The monitor provides a (configurable) period of grace time before it terminates the hung NameNode process.

  1. Start the NameNode VM and run the HAM application configured to work with this NameNode.

  2. In HAM, start non-blocking operations.

  3. SSH to the NameNode VM and identify the NameNode process.

    jps -v| grep namenode
  4. Suspend the NameNode process.

    kill -19 namenode-process-id-here
  5. Ensure that you see the following expected results:

    • In HAM, the NameNode status area must indicate hung status. The non-blocking operations, that are initiated, will now appear to be blocked (the hung NameNode prevents these operations from completing).

    • In the vSphere Management UI, the vSphere should terminate the NameNode VM and start a new instance within a delay of approximately 2-3 minutes.

    • In HAM, the NameNode status area should indicate offline status. The non-blocking operations should now report failure.

    • Once the NameNode service restarts, its status must be displayed in both the vSphere UI and in the status indicator of HAM.

    • The operations started in HAM will now start succeeding.

This test can be repeated when HAM performs blocking operations.

In this case, the active file­system operation (the operation when the NameNode was suspended) will fail when the NameNode is restarted and reported as such. This failure happens because the open socket connection breaks and these connections are not preserved during a fail over.