This is something that always rise doubts:
When considering compressed data that will be processed by MapReduce, it is important to check if the compression format supports splitting. If not, the number of map tasks may not be the expected.
Let’s suppose an uncompressed file stored in HDFS whose size is 1 GB: With a HDFS block size of 64 MB, the file will be stored as 16 blocks, and a MapReduce job using this file as input will create 16 input splits, each processed independently as input to a separate map task.
Now if the file is a gzip-compressed file whose compressed size is 1 GB: As before, HDFS will store the file as 16 blocks. But, creating a split for each block will not work since it is impossible to start reading at an arbitrary point in the gzip stream, and therefore impossible for a map task to read its split independently of the others.
In this case, MapReduce will not try to split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting.
At this scenario a single map will process the 16 HDFS blocks, most of which will not be local to the map (it will have additionally a data locality cost).
This Job, will not parallelize as expected, it will be less granular, and so may take longer to run.
The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting.
Here we have a summary of compression formats:
(a) DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool for producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers and a footer.) The .deflate filename extension is a Hadoop convention.
Source: Hadoop The Definitive Guide.