4.8.3. Validate the fail over behavior

Use the following tests to verify fail over behavior. (These tests can also be used to verify that the availability monitoring can be suspended for administrative tasks.)

 Verify that NameNode failure triggers the fail over

  1. Start the NameNode VM and run the HAM application configured to work with this NameNode.

  2. In HAM, start blocking LS operations.

  3. SSH to the NameNode VM and terminate the NameNode process.

     service hadoop-namenode stop

    Alternatively, identify the NameNode process (jps -v) and issue kill -9 command.

  4. Ensure that you see the following expected results:

    • In HAM, the NameNode status area (at the top of the application) should display offline status for NameNode. The main area should also stop displaying any new text (this indicates that the file system operations are blocked).

    • In the vSphere Management UI, the vSphere should terminate the NameNode VM within 60-90 seconds and must start a new instance.

    • Once the NameNode service restarts, its status must be displayed in both the vSphere UI and in the status indicator of HAM.

    • The blocking operations started in HAM must now continue. The fail over should not affect the client except for the pause during fail over.

    • SSH to the NameNode VM again and verify that the host name, IP address, and SSH host key have not changed.

 Verify that a hung NameNode triggers the fail over

This test verifies that the VM does not fail immediately after the NameNode process is hung. The monitor considers this time period as a Garbage Collection-related pause. The monitor provides a (configurable) period of grace time before it terminates the hung NameNode process.

  1. Start the NameNode VM and run the HAM application configured to work with this NameNode.

  2. In HAM, start non-blocking operations.

  3. SSH to the NameNode VM and identify the NameNode process.

    jps -v| grep namenode
  4. Suspend the NameNode process.

    kill -19 namenode-process-id-here
  5. Ensure that you see the following expected results:

    • In HAM, the NameNode status area must indicate hung status. The non-blocking operations, that are initiated, will now appear to be blocked (the hung NameNode prevents these operations from completing).

    • In the vSphere Management UI, the vSphere should terminate the NameNode VM and start a new instance within a delay of approximately 2-3 minutes.

    • In HAM, the NameNode status area should indicate offline status. The non-blocking operations should now report failure.

    • Once the NameNode service restarts, its status must be displayed in both the vSphere UI and in the status indicator of HAM.

    • The operations started in HAM will now start succeeding.

This test can be repeated when HAM performs blocking operations.

In this case, the active file­system operation (the operation when the NameNode was suspended) will fail when the NameNode is restarted and reported as such. This failure happens because the open socket connection breaks and these connections are not preserved during a fail over.

 Verify that ESXi server failure triggers the fail over

This test verifies that the HA solution detects the failures of the physical hardware and also trigger fail overs.

  1. Start the NameNode VM in an ESXi server that is not running any other VMs.

  2. Run the HAM application configured to work against this NameNode.

  3. In the HAM, start blocking LS operations.

  4. Initiate a power down of the ESXi server.

  5. Ensure that you see the following expected results:

    • The main area should stop displaying new text - this indicates that the file system operations are blocked.

    • In the vSphere management UI, once the loss of the ESXi server is detected, the NameNode VM is re-instantiated on one of the remaining ESXi servers.

    • Once the NameNode service restarts, its status must be displayed in both the vSphere UI and in the status indicator of HAM.

    • The blocked LS operation started in HAM should now continue without failures.

 Verify that no fail over is triggered on planned shutdown of the monitor service

This test verifies that if the monitor service is shut down the fail over is not triggered. The NameNode can now be manipulated as part of planned management operations.

  1. Start the NameNode VM and SSH to the NameNode VM.

  2. Terminate the monitor process.

    service hmonitor-namenode-monitor stop
    [Note]Note

    A kill -9 command for the monitor is not a graceful shutdown and will trigger fail over.

  3. Terminate the NameNode process.

    service hadoop-namenode stop
  4. Ensure that you see the following expected results:

    • In the vSphere Management UI, the NameNode VM should be live.

    • The fail over should not be initiated by vSphere.

    • In HAM, the NameNode status area should indicate offline status. The non-blocking operations should now report failure.

    • The SSH connection must not be broken and the VM should be live.

  5. Restart the monitor process.

    service hmonitor-namenode-monitor start
  6. Restart the NameNode process.

    service hadoop-namenode start
  7. The NameNode health should be monitored and failures should trigger fail over.

 Verify that the monitor provides a bootstrap period before reporting that the NameNode is not live

This test verifies that the monitoring process includes a bootstrap period.

The bootstrap period ensures that the monitor will immediately not report a failure and trigger a restart. Instead, the monitor provides the service a bootstrap period in which probes are allowed to (initially fail). This bootstrap period is configurable (see: Tuning the bootstrap timeout).

  1. Start the NameNode VM and SSH to the NameNode VM.

  2. Terminate the monitor process.

    service hmonitor-namenode-monitor stop
  3. Terminate the NameNode process.

    service hadoop-namenode stop
  4. Restart the monitor process.

    service hmonitor-namenode-monitor start

 Verify that no fail over is triggered when the NameNode enters the safe mode

This test verifies that the VM is not restarted if the NameNode enters safe mode. This allows administration operations to be performed on a file system in safemode without having to disable HA services.

  1. Start the NameNode VM and SSH to the NameNode VM.

  2. Enter safe mode.

    hadoop dfsadmin -safemode enter
  3. Ensure that you see the following expected results:

    • In the vSphere UI, the NameNode VM should be live.

    • The SSH session should exist and the VM should be live.

  4. Terminate the NameNode process.

    service hadoop-namenode stop
  5. The vSphere should identify the NameNode failure and should restart the VM.

[Note]Note

This test shows that even in safe mode the fail over is triggered if the NameNode process is terminated. To avoid automatic restart for NameNode after performing safe mode operations, use the service hmonitor-namenode-monitor restart to restart the monitor service.


loading table of contents...