Chapter 1. Configuring HDFS Compression

This document is intended for system administrators who need to configure HDFS compression on Windows platform.

Windows supports GzipCodec, DefaultCodec, and BZip2Codec. Typically, GzipCodec is popularly used for HDFS compression.

Ensure that zlib1.dll is installed in the %HADOOP_HOME%\bin directory on all the nodes of the cluster.

Use the following instructions to use GZipCodec

  • Option I: To use GzipCodec with a one-time only job:

    1. On the NamNode host machine, execute the following commands as hdfs user:

      hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort "-Dmapred.compress.map.output=true" "-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" "-Dmapred.output.compress=true" "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output

  • Option II: To enable GzipCodec as the default compression:

    1. Edit the core-site.xml file on the NameNode host machine:

      <property>     
        <name>io.compression.codecs</name>
         <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
         <description>A list of the compression codec classes that can be used for compression/decompression.</description>   
      </property> 
      
    2. Edit mapred-site.xml file on the JobTracker host machine:

      <property>
        <name>mapred.compress.map.output</name>
        <value>true</value>
      </property>  
       
      <property>     
         <name>mapred.map.output.compression.codec</name>
         <value>org.apache.hadoop.io.compress.GzipCodec</value>   
      </property> 
        
      <property>     
         <name>mapred.output.compression.type</name>        
         <value>BLOCK</value>
      </property> 

    3. [Optional] - Enable the following two configuration parameters to enable job output compression.

      Edit mapred-site.xml file on the JobTracker host machine:

      <property>     
        <name>mapred.output.compress</name>
        <value>true</value>   
      </property>   
      
      <property>     
         <name>mapred.output.compression.codec</name>
         <value>org.apache.hadoop.io.compress.GzipCodec</value>   
      </property> 
    4. Restart the cluster using instructions provided here.