Summary

In this chapter, you saw optimizations at different stages of the Hadoop MapReduce pipeline. With the join example, we saw a few other advanced features available for MapReduce jobs. Some key takeaways from this chapter are as follows:

Too many Map tasks that are I/O bound should be avoided. Inputs dictate the number of Map tasks.
Map tasks are primary contributors for job speedup due to parallelism.
Combiners increase efficiency not only in data transfers between Map tasks and Reduce tasks, but also reduce disk I/O on the Map side.
The default setting is a single Reduce task.
Custom partitioners can be used for load balancing among Reducers.
DistributedCache is useful for side file distribution of small files. Too many and too large files in ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hadoop: Data Processing and Modelling by Garry Turkington, Tanmay Deshpande, Sandeep Karanth

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly