Chapter 2. HBase Cluster Capacity and Region Sizing

This section describes how to plan the capacity of an HBase cluster and the size of its RegionServers.

The following table provides information about HBase concepts:

HBase Concept

Description

Region

A group of contiguous HBase table rows. Tables start with one region and additional regions are dynamically added as the table grows. Regions can be spread across multiple hosts to provide load balancing and quick recovery from failure. There are two types of region: primary and secondary. A secondary region is a replicated primary region located on a different region server.

RegionServer

Serves data requests for one or more regions. A single region is serviced by only one RegionServer, but a region server may serve multiple regions.

Column family

A group of semantically related columns stored together.

Memstore

In-memory storage for a region server. region servers write files to HDFS after the memstore reaches a configurable maximum value specified with the hbase.hregion.memstore.flush.size property in the hbase-site.xml configuration file.

Write Ahead Log (WAL)

In-memory log where operations are recorded before they are stored in the memstore.

Compaction storm

When the operations stored in the memstore are flushed to disk, HBase consolidates and merges many smaller files into fewer large files. This consolidation is called compaction, and it is usually very fast. However, if many region servers hit the data limit specified by the memstore at the same time, HBase performance may degrade from the large number of simultaneous major compactions. Administrators can avoid this by manually splitting tables over time.