3. Balanced Workload Deployments

When a team is just starting with Hadoop or HBase, begin small and gain experience by measuring actual workloads during a pilot project. We recommend starting with a relatively small pilot cluster, provisioned for a “balanced” workload.

For pilot deployments, you can start with 1U/machine and use the following recommendations:

Two quad core CPUs | 12 GB to 24 GB memory | Four to six disk drives of 2 terabyte (TB) capacity.

The minimum requirement for network is 1GigE all-to-all and can be easily achieved by connecting all of your nodes to a Gigabit Ethernet switch. In order to use the spare socket for adding more CPUs in future, you can also consider using either a six or an eight core CPU.

For small to medium HBase clusters, provide each Zookeeper server around 1GB of RAM and if possible its own disk.

Jumpstart - Hadoop Cluster

One way to quickly deploy Hadoop cluster, is to opt for “cloud trials” or use virtual infrastructure. Horton­works makes the distribution available through Hortonworks Data Platform (HDP). HDP can be easily installed in public and private clouds using Whirr, Microsoft Azure, and Amazon Web Services.

To contact Hortonworks Technical Support, please log a case at: https://support.hortonworks.com/. If you are currently not an official Hortonworks Customer or Partner, then please seek assistance on our Hortonworks Forums at:http://hortonworks.com/community/forums/

However, note that cloud services and virtual infrastructures are not architected for Hadoop. Hadoop and HBase deployments in this case, might experience poor performance due to virtualization and suboptimal I/O architecture.

Tracking resource usage for pilot deployments

Hortonworks recommends that you monitor your pilot cluster using Ganglia, Nagios, or other performance monitoring framework that may be in use in your data center. You can also use the following guidelines to monitor your Hadoop and HBase clusters:

  • Measure resource usage for CPU, RAM, Disk I/O operation per second (IOPS), and network packets sent and received. Run the actual kinds of query or analysis jobs that are of interest to your team.

  • Ensure that you your data sub-set scaled to the size of your pilot cluster.

  • Analyze the monitoring data for resource saturation. Based on this analysis, you can categorize your jobs as CPU bound, Disk I/O bound, or Network I/O bound.

    [Note]Note

    Most Java applications expand RAM usage to the maximum allowed. However, such jobs should not be analyzed as memory bound unless swapping happens or if the JVM experiences full-memory garbage collection events. (Full-memory garbage collection events are typically occur when the node appears to cease all useful work for several minutes at a time.)

  • Optionally, customize your job parameters or hardware or network configurations to balance resource usage. If your jobs fall in the various workload patterns equally, you may also choose to only manipulate the job parameters and keep the hardware choices “balanced”.

  • For HBase cluster, you should also analyze Zookeeper, because the network and memory problems for HBase are often detected first in Zookeeper.

Using Hortonworks Data Platform (HDP) Monitoring Dashboard

You can also use the HDP Monitoring Dashboard for monitoring key metrics and alerts of your Hadoop clusters. HDP Monitoring Dashboard provides out of the box integration with Ganglia and Nagios. For more details, see: Hortonworks Data Platform.

Challenges - Tuning job characteristics to resource usage

Relating job characteristics to resource requirements is tricky for a variety of reasons we can only touch on briefly here. The method in which a job is coded or the job data is represented can have large impact on resource balance.  For example, resource cost can be shifted between disk IOPS and CPU by the choice of compression scheme or parsing format; or per-node CPU and disk activity can be traded for inter-node bandwidth by the implementation of the Map/Reduce strategy.

Furthermore, Amdahl’s Law shows how resource requirements can change in grossly non-linear ways with changing demands: a change that might be expected to reduce computation cost by 50% may instead cause a 10% change or a 90% change in net performance.

Reusing pilot machines

With a pilot cluster in place, start analyzing workloads patterns to identify CPU and I/O bottlenecks. It is common to have heterogeneous Hadoop clusters, especially as they evolve in size. Starting with a set of machines that are not perfect for your workload will not impact your reuse capability, because these machines can be reused in the production clusters.

[Tip]Tip

To achieve a positive return on investment (ROI), ensure that the machines in your pilot clusters are less than 10% of your production cluster.