2. Add DataNodes or TaskTrackers

Use the following instructions to manually add a DataNode or a TaskTracker hosts:

  1. On each of the newly added slave nodes, add the HDP repository to yum:

    wget -nv //public-repo-1.hortonworks.com/HDP-1.2.0/repos/centos6/hdp.repo -O
    /etc/yum.repos.d/hdp.repo
    yum clean all
  2. On each of the newly added slave nodes, install HDFS and MapReduce.

    • On RHEL and CentOS:

      yum install hadoop hadoop-libhdfs hadoop-native
      yum install hadoop-pipes hadoop-sbin openssl
    • On SLES:

      zypper install hadoop hadoop-libhdfs hadoop-native
      zypper install hadoop-pipes hadoop-sbin openssl

  3. On each of the newly added slave nodes, install Snappy compression/decompression library:

    1. Check if Snappy is already installed:

      rpm-qa | grep snappy
    2. Install Snappy on the new nodes:

      • For RHEL/CentOS:

        yum install snappy snappy-devel 
      • For SLES:

        zypper install snappy snappy-devel
        ln -sf /usr/lib64/libsnappy.so
        /usr/lib/hadoop/lib/native/Linux-amd64-64/.
  4. Optional - Install the LZO compression library.

    • On RHEL and CentOS:

      yum install lzo-devel hadoop-lzo-native
    • On SLES:

      zypper install lzo-devel hadoop-lzo-native

  5. Copy the Hadoop configurations to the newly added slave nodes and set appropriate permissions.

    • Option I: Copy Hadoop config files from an existing slave node.

      1. On an existing slave node, make a copy of the current configurations:

        tar zcvf hadoop_conf.tgz /etc/hadoop/conf              
      2. Copy this file to each of the new nodes:

        rm -rf /etc/hadoop/conf
        cd /
        tar zxvf $location_of_copied_conf_tar_file/hadoop_conf.tgz
        chmod -R 755 /etc/hadoop/conf

    • Option II: Manually add Hadoop configuration files.

      1. Download core Hadoop configuration files from here and extract the files under configuration_files -> core_hadoop directory to a temporary location.

      2. In the temporary directory, locate the following files and modify the properties based on your environment. Search for TODO in the files for the properties to replace.

        Table 6.1. core-site.xml
        Property Example Description
        fs.default.name hdfs://{namenode.full.host­name}:8020 Enter your NameNode hostname
        fs.checkpoint.dir /grid/hadoop/hdfs/snn A comma separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR..

        Table 6.2. hdfs-site.xml
        Property Example Description
        dfs.name.dir /grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn Comma separated list of paths. Use the list of directories from $DFS_NAME_DIR
        dfs.data.dir /grid/hadoop/hdfs/dn,grid1/hadoop/hdfs/dn Comma separated list of paths. Use the list of directories from $DFS_DATA_DIR
        dfs.http.address {namenode.full.host­name}:50070 Enter your NameNode hostname for http access
        dfs.secondary.http.address {secondary.namenode.full.host­name}:50090 Enter your SecondaryNameNode hostname
        dfs.https.address {namenode.full.host­name}:50470 Enter your NameNode hostname for https access.

        Table 6.3. mapred-site.xml
        Property Example Description
        mapred.job.tracker {jobtracker.full.host­name}:50300 Enter your JobTracker hostname
        mapred.job.tracker.http.address {jobtracker.full.host­name}:50030 Enter your JobTracker hostname
        mapred.local.dir /grid/hadoop/mapred,/grid1/hadoop/mapred Comma separated list of paths. Use the list of directories from $MAPREDUCE_LOCAL_DIR
        mapreduce.tasktracker.group hadoop Enter your group. Use the value of $HADOOP_GROUP
        mapreduce.history.server.http.address {jobtracker.full.hostname}:51111 Enter your JobTracker hostname

        Table 6.4. taskcontroller.cfg
        Property Example Description
        mapred.local.dir /grid/hadoop/mapred,/grid1/hadoop/mapred Comma separated list of paths. Use the list of directories from $MAPREDUCE_LOCAL_DIR

      3. Create the config directory on all hosts in your cluster, copy in all the configuration files, and set permissions.

        rm -r $HADOOP_CONF_DIR
        mkdir -p $HADOOP_CONF_DIR
        
         <copy the all the config files to $HADOOP_CONF_DIR>  
        chmod a+x $HADOOP_CONF_DIR/
        chown -R $HDFS_USER:$HADOOP_GROUP $HADOOP_CONF_DIR/../
        chmod -R 755 $HADOOP_CONF_DIR/../

  6. On each of the newly added slave nodes, start HDFS:

    su -hdfs
    /usr/lib/hadoop/bin/hadoop-daemon.sh --config
    $HADOOP_CONF_DIR start datanode

  7. On each of the newly added slave nodes, start MapReduce:

    su -mapred
    /usr/lib/hadoop/bin/hadoop-daemon.sh --config
    $HADOOP_CONF_DIR start tasktracker       

  8. Add new slave nodes.

    • To add a new NameNode slave (DataNode):

      1. On the NameNode host machine, edit the /etc/hadoop/conf/dfs.include file and add the list of slave nodes' hostnames (separated by newline character).

        [Important]Important

        Ensure that you create a new dfs.include file, if the NameNode host machine does not have an existing copy of this file.

      2. On the NameNode host machine, execute the following command:

        su – hdfs –c “hadoop dfsadmin –refreshNodes”

    • To add a new JobTracker slave (TaskTracker):

      1. One the JobTracker host machine, edit the /etc/hadoop/conf/mapred.include file and add the list of slave nodes' hostnames (separated by newline character).

        [Important]Important

        Ensure that you create a new mapred.include file, if the JobTracker host machine does not have an existing copy of this file.

      2. On the JobTracker host machine, execute the following command:

        su – mapred –c “hadoop mradmin –refreshNodes”

  9. Optional - Enable monitoring on the newly added slave nodes using the instructions provided here.

  10. Optional - Enable cluster alerting on the newly added slave nodes using the instructions provided here.