2. Typical Workload Patterns For Hadoop

Disk space, I/O bandwidth (required by Hadoop), and computational power (required for MapReduce and other CPU-intensive processes) are the most important parameters for accurate hardware sizing. If you are installing HBase, you also need to analyze your application and its memory requirements, because HBase is a memory intensive component.

The following workload patterns are commonly observed in HDP production environments:

Balanced Workload

If your workload is distributed equally across various job types (CPU bound, Disk I/O bound, or Network I/O bound), your cluster has a balanced workload pattern.

Compute-Intensive

Compute-intensive workloads are CPU bound and are characterized by the need for a large number of CPUs and large amounts of memory to store in-process data. (This usage pattern is typical for natural language processing and HPCC workloads.)

I/O Intensive

A typical MapReduce job (like sorting) requires very little compute power. Instead it relies more on the I/O capacity of the cluster (for example, if you have a lot of cold data). For this type of workload, Hortonworks recommends investing in more disks per server.

Unknown or evolving workload patterns

You may not know your eventual workload patterns at deployment time, and the first jobs submitted to Hadoop in the early days are usually very different than the actual jobs you will run in your production environment. For these reasons, Hortonworks recommends that you either use the balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve its structure as you analyze the workload patterns in your environment. For more information, see Early Hadoop Deployments.