Considerations for deploying erasure coding

You must consider factors like network bisection bandwidth and fault-tolerance at the level of the racks while deploying erasure coding in your HDFS clusters.

Erasure Coding places additional demands on the cluster in terms of CPU and network.

Erasure coded files are spread across racks for fault-tolerance. This means that when reading and writing striped files, most operations are off-rack. Thus, network bisection bandwidth is very important.

For fault-tolerance at the rack level, it is also important to have at least as many racks as the configured EC stripe width. For the default EC policy of RS (6,3), this means minimally 9 racks, and around 10 or 11 to handle planned and unplanned outages. For clusters with fewer racks than the stripe width, HDFS cannot maintain fault-tolerance at the rack level, but still attempts to spread a striped file across multiple nodes to preserve fault-tolerance at the node level.