Chapter 17. Configuring HDFS Compression

This document is intended for system administrators who need to configure HDFS compression on Linux.

Linux supports GzipCodec, DefaultCodec, BZip2Codec, LzoCodec, and SnappyCodec. Typically, GzipCodec is used for HDFS compression.

Use the following instructions to use GZipCodec

  • Option I: To use GzipCodec with a one-time only job:

    1. On the NameNode host machine, execute the following commands as hdfs user:

      hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort "-Dmapred.compress.map.output=true" "-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" "-Dmapred.output.compress=true" "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output

  • Option II: To enable GzipCodec as the default compression:

    1. Edit the core-site.xml file on the NameNode host machine:

        <property>
          <name>io.compression.codecs</name>
          <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
          <description>A list of the compression codec classes that can be used
                       for compression/decompression.</description>
        </property> 
    2. Edit mapred-site.xml file on the JobTracker host machine:

      <property>
        <name>mapred.compress.map.output</name>
        <value>true</value>
      </property>  
       
      <property>     
         <name>mapred.map.output.compression.codec</name>
         <value>org.apache.hadoop.io.compress.GzipCodec</value>   
      </property> 
        
      <property>     
         <name>mapred.output.compression.type</name>        
         <value>BLOCK</value>
      </property> 

    3. [Optional] - Enable the following two configuration parameters to enable job output compression.

      Edit mapred-site.xml file on the Resource Manager host machine:

      <property>     
        <name>mapred.output.compress</name>
        <value>true</value>   
      </property>   
      
      <property>     
         <name>mapred.output.compression.codec</name>
         <value>org.apache.hadoop.io.compress.GzipCodec</value>   
      </property> 
    4. Restart the cluster using the instructions provided on this page.


loading table of contents...