The Map task

The efficiency of the Map phase is decided by the specifications of the job inputs. We saw that having too many small files leads to proliferation of Map tasks because of a large number of splits. Another important statistic to note is the average runtime of a Map task. Too many or too few Map tasks are both detrimental for job performance. Striking a balance between the two is important, much of which depends on the nature of the application and data.

Tip

A rule of thumb is to have the runtime of a single Map task to be around a minute to three minutes, based on empirical evidence.

The dfs.blocksize attribute

The default block size of files in a cluster is overridden in the cluster configuration file, hdfs-site.xml, generally present ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.