1. Collect Troubleshooting Information

Use the following commands to collect specific information from a Windows based cluster. This data helps to isolate specific deployment issue.

  1. Collect OS information: This data helps to determine if HDP is deployed on a supported operating system (OS).

    Execute the following commands on Powershell as an Administrator user:

    (Get-WmiObject -class Win32_OperatingSystem).Caption

    This command should provide you information about the OS for your host machine. For example,

    Microsoft Windows Server 2012 Standard

    Execute the following command to determine OS Version for your host machine:

    [System.Environment]::OSVersion.Version

  2. Determine installed software: This data can be used to troubleshoot either performance issues or unexpected behavior for a specific node in your cluster. For example, unexpected behavior can be the situation where a MapReduce job runs for longer duration than expected.

    To see the list of installed software on a particular host machine, go to Control Panel -> All Control Panel Items -> Programs and Features.

  3. Detect runnning processes: This data can be used to troubleshoot either performance issues or unexpected behavior for a specific node in your cluster.

    You can either press CTRL + SHIFT + DEL on the affected host machine or you can execute the following command on Powershell as an Administrator user:

    tasklist 
  4. Detect Java runnning processes: Use this command to verify the Hadoop processes running on a specific machine.

    As $HADOOP_USER, execute the following command on the affected host machine:

    su $HADOOP_USER
    jps

    You should see the following output:

    988 Jps
    2816 -- process information unavailable
    2648 -- process information unavailable
    1768 -- process information unavailable

    Note that no actual name is given to any process. Ensure that you map the process IDs (pid) from the output of this command to the .wrapper file within the C:\hdp\hadoop-1.1.0-SNAPSHOT\bin directory.

    [Note]Note

    Ensure that you provide complete path to the Java executable, if Java bin directory's location is not set within your PATH.

  5. Detect Java heap allocation and usage: Use the following command to list Java heap information for a specific Java process. This data can be used to verify the heap settings and thus analyze if a particular Java process is reaching the threshold.

    Execute the following command on the affected host machine:

    jmap -heap $pid_of_Hadoop_process
    

    For example, you should see output similar to the following:

    C:\hdp\hadoop-1.1.0-SNAPSHOT>jmap -heap 2816
    Attaching to process ID 2816, please wait...
    Debugger attached successfully.
    Server compiler detected.
    JVM version is 20.6-b01
    
    using thread-local object allocation.
    Mark Sweep Compact GC
    
    Heap Configuration:
       MinHeapFreeRatio = 40
       MaxHeapFreeRatio = 70
       MaxHeapSize      = 4294967296 (4096.0MB)
       NewSize          = 1310720 (1.25MB)
       MaxNewSize       = 17592186044415 MB
       OldSize          = 5439488 (5.1875MB)
       NewRatio         = 2
       SurvivorRatio    = 8
       PermSize         = 21757952 (20.75MB)
       MaxPermSize      = 85983232 (82.0MB)
    
    Heap Usage:
    New Generation (Eden + 1 Survivor Space):
       capacity = 10158080 (9.6875MB)
       used     = 4490248 (4.282234191894531MB)
       free     = 5667832 (5.405265808105469MB)
       44.203707787298384% used
    Eden Space:
       capacity = 9043968 (8.625MB)
       used     = 4486304 (4.278472900390625MB)
       free     = 4557664 (4.346527099609375MB)
       49.60548290307971% used
    From Space:
       capacity = 1114112 (1.0625MB)
       used     = 3944 (0.00376129150390625MB)
       free     = 1110168 (1.0587387084960938MB)
       0.35400390625% used
    To Space:
       capacity = 1114112 (1.0625MB)
       used     = 0 (0.0MB)
       free     = 1114112 (1.0625MB)
       0.0% used
    tenured generation:
       capacity = 55971840 (53.37890625MB)
       used     = 36822760 (35.116920471191406MB)
       free     = 19149080 (18.261985778808594MB)
       65.7880105424442% used
    Perm Generation:
       capacity = 21757952 (20.75MB)
       used     = 20909696 (19.9410400390625MB)
       free     = 848256 (0.8089599609375MB)
       96.10139777861446% used
    

  6. Show open files: Use Process Explorer to determine which processes are locked on a specific file. See Windows Sysinternals - Process Explorer for information on using Process explorer.

    For example, you can use Process Explorer to troubleshoot the file lock issues that prevent a particular process from starting as shown in the screenshot below:

  7. Verify well-formed XML:

    Ensure that the Hadoop configuration files (for example, hdfs-site.xml, etc.) are well formed.

    You can either use Notepad++ or any third-party tools like Oxygen, XML Spy, etc. to validate the configuration files. Use the following instructions:

    1. Open the XML file to be validated in Notepad++ and select XML Tools -> Check XML Syntax .

    2. Resolve validation errors, if any.

  8. Detect AutoStart Programs: This information helps to isolate errors for a specific host machine.

    For example, a potential port conflict between auto-started process and HDP processes, might prevent launch for one of the HDP components.

    Ideally, the cluster administrator must have the information on auto-start programs handy. Use the following command to launch the GUI interface on the affected host machine:

    C:\Windows\System32\msconfig.exe

    Click Startup tab. Ensure that no startup items are enabled on the affected host machine.

  9. Collect list of all mounts on the machine: This information determines the drives that are actually mounted or available on the host machine for use. To troubleshoot disks capacity issues, use this command to determine if the system is violating any storage limitations.

    Execute the following command on Powershell:

    Get-Volume

    You should see output similar to the following:

    DriveLetter       FileSystemLabel  FileSystem       DriveType        HealthStatus        SizeRemaining             Size
    -----------       ---------------  ----------       ---------        ------------        -------------             ----
                      System Reserved  NTFS             Fixed            Healthy                  108.7 MB           350 MB
    C                                  NTFS             Fixed            Healthy                  10.74 GB         19.97 GB
    D                 HRM_SSS_X64FR... UDF              CD-ROM           Healthy                       0 B          3.44 GB
    

  10. Operating system messages Use Event Viewer to detect messages with a system or an application.

    Event Viewer can determine if a machine was rebooted or shut down at a particular time. Use the logs to isolate issues for HDP services that were non-operational for a specific time.

    Go to Control Panel -> All Control Panel Items -> Administrative Tools and click the Event Viewer icon.

  11. Hardware/system information: Use this information to issolate hardware issues on the affected host machine.

    Go to Control Panel -> All Control Panel Items -> Administrative Tools and click the System Information icon.

  12. Network information: Use the following commands to troubleshoot network issues.

    • ipconfig: This command provides the IP address, validates if the network interfaces are available, and also validates if an IP address is bound to the interfaces. To troubleshoot communication issues between the host machines in your cluster, execute the following command on the affected host machine:

      ipconfig

      You should see output similar to the following:

      Windows IP Configuration
      
      Ethernet adapter Ethernet 2:
      
         Connection-specific DNS Suffix  . :
         Link-local IPv6 Address . . . . . : fe80::d153:501e:5df0:f0b9%14
         IPv4 Address. . . . . . . . . . . : 192.168.56.103
         Subnet Mask . . . . . . . . . . . : 255.255.255.0
         Default Gateway . . . . . . . . . : 192.168.56.100
      
      Ethernet adapter Ethernet:
      
         Connection-specific DNS Suffix  . : test.tesst.com
         IPv4 Address. . . . . . . . . . . : 10.0.2.15
         Subnet Mask . . . . . . . . . . . : 255.255.255.0
         Default Gateway . . . . . . . . . : 10.0.2.2
      

    • netstat -ano: This command provides a list of used ports within the system. Use this command to troubleshoot launch issues with HDP master processes. Execute the following command on the host machine to resolve potential port conflict:

      netstat -ano

      You should see output similar to the following:

       TCP    0.0.0.0:49154          0.0.0.0:0              LISTENING       752
        TCP    [::]:49154             [::]:0                 LISTENING       752
        UDP    0.0.0.0:500            *:*                                    752
        UDP    0.0.0.0:3544           *:*                                    752
        UDP    0.0.0.0:4500           *:*                                    752
        UDP    10.0.2.15:50461        *:*                                    752
        UDP    [::]:500               *:*                                    752
        UDP    [::]:4500              *:*                                    752
      

    • Verify if firewall is enabled on the host machine: Go to Control Panel -> All Control Panel Items -> Windows Firewall .

      You should see the following GUI interface: