1. Typical Hadoop Cluster

Hadoop and HBase clusters have two types of machines: masters (the HDFS NameNode, the MapReduce JobTracker, and the HBase Master) and slaves (the HDFS DataNodes, the MapReduce Task­Trackers, and the HBase RegionServers). The DataNodes, TaskTrackers, and HBase RegionServers are co-located or co-deployed for optimal data locality. In addition, HBase requires the use of a separate component (ZooKeeper) to manage the HBase cluster.

Hortonworks recommends separating master and slave nodes because of the following reasons:

  • Task workloads on the slave nodes should be isolated from the masters.

  • Slaves nodes are frequently decommissioned for maintainance.

For evaluation purpose, you can also choose to deploy Hadoop using single-node installation (all the masters and the slave processes reside on the same machine). Setting up a small cluster (of two nodes) is a very straightforward task - one node acts as both NameNode/JobTracker and the other node acts as DataNode and TaskTracker. Clusters of three or more machines typically use a dedicated NameNode/JobTracker and all the other nodes act as the slave nodes. Typically, medium to large Hadoop cluster consists of a two or three-level architecture built with rack-mounted servers. Each rack of servers is interconnected using a 1 Gigabit Ethernet (GbE) switch. Each rack-level switch is connected to a cluster-level switch (which is typically a larger port-density 10GbE switch). These cluster-level switches may also interconnect with other cluster-level switches or even uplink to another level of switching infrastructure.