When you put a file into HDFS, it is converted to blocks of 128 MB. (Default value for HDFS on EMR) Consider a file big enough to consume 10 blocks. When you read that file from HDFS as an input for a MapReduce job, the same blocks are usually mapped, one by one, to splits.In this case, the file is divided into 10 splits (which implies means 10 map tasks) for processing. By default, the block size and the split size are equal, but the sizes are dependent on the configuration settings for the InputSplit
class.
From a Java programming perspective, the class that holds the responsibility of this conversion is called an InputFormat
, which is the main entry point into reading data from HDFS. From the blocks of the files, it creates a list of InputSplits
. For each split, one mapper is created. Then each InputSplit
is divided into records by using the RecordReader
class. Each record represents a key-value pair.
FileInputFormat
vs. CombineFileInputFormat
Before a MapReduce job is run, you can specify the InputFormat
class to be used. The implementation of FileInputFormat
requires you to create an instance of the RecordReader
, and as mentioned previously, the RecordReader
creates the key-value pairs for the mappers.
FileInputFormat
is an abstract class that is the basis for a majority of the implementations of InputFormat
. It contains the location of the input files and an implementation of how splits must be produced from these files. How the splits are converted into key-value pairs is defined in the subclasses. Some example of its subclasses are TextInputFormat
, KeyValueTextInputFormat
, and CombineFileInputFormat
.
Hadoop works more efficiently with large files (files that occupy more than 1 block). FileInputFormat
converts each large file into splits, and each split is created in a way that contains part of a single file. As mentioned, one mapper is generated for each split.

However, when the input files are smaller than the default block size, many splits (and therefore, many mappers) are created. This arrangement makes the job inefficient. This Figure shows how too many mappers are created when FileInputFormat
is used for many small files.

To avoid this situation, CombineFileInputFormat
is introduced. This InputFormat
works well with small files, because it packs many of them into one split so there are fewer mappers, and each mapper has more data to process. This Figure shows how CombineFileInputFormat
treats the small files so that fewer mappers are created.
