Summary

In this chapter, you saw optimizations at different stages of the Hadoop MapReduce pipeline. With the join example, we saw a few other advanced features available for MapReduce jobs. Some key takeaways from this chapter are as follows:

  • Too many Map tasks that are I/O bound should be avoided. Inputs dictate the number of Map tasks.
  • Map tasks are primary contributors for job speedup due to parallelism.
  • Combiners increase efficiency not only in data transfers between Map tasks and Reduce tasks, but also reduce disk I/O on the Map side.
  • The default setting is a single Reduce task.
  • Custom partitioners can be used for load balancing among Reducers.
  • DistributedCache is useful for side file distribution of small files. Too many and too large files in ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.