The Reduce task

The Reduce task is an aggregation step. If the number of Reduce tasks is not specified, the default number is one. The risk of running one Reduce task would mean overloading that particular node. Having too many Reduce tasks would mean shuffle complexity and proliferation of output files that puts pressure on the NameNode. It is important to understand the data distribution and the partitioning function to decide the optimal number of Reduce tasks.

Tip

The ideal setting for each Reduce task to process is a range of 1 GB to 5 GB.

The number of Reduce tasks can be set using the mapreduce.job.reduces parameter. It can be programmatically set by calling the setNumReduceTasks() method on the Job object. There is a cap on the number of ...

Get Mastering Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.