The efficiency of the Map phase is decided by the specifications of the job inputs. We saw that having too many small files leads to proliferation of Map tasks because of a large number of splits. Another important statistic to note is the average runtime of a Map task. Too many or too few Map tasks are both detrimental for job performance. Striking a balance between the two is important, much of which depends on the nature of the application and data.
A rule of thumb is to have the runtime of a single Map task to be around a minute to three minutes, based on empirical evidence.
The default block size of files in a cluster is overridden in the cluster configuration file,
hdfs-site.xml, generally present ...