Chapter 19

The MapReduce Pattern

WHAT’S IN THIS CHAPTER?

  • A simple MapReduce implementation
  • Abstracting the problem

The idea of MapReduce was proposed by Google in 1994. Google has since received a (much criticized) software patent for it, and the technology has been implemented in a number of open source and commercial libraries and products.

MapReduce is a framework that allows parallel execution of the steps required to perform potentially complex computations over possibly very large sets of data. Any number of physical or logical nodes can be involved in the process, and the approach can be used for parallelization on just one physical machine. Keep in mind the rather large amounts of data! The algorithms implemented with MapReduce are not usually very complex, so the overhead of the implementation pattern and the backend infrastructure make it worthwhile only if the benefit of parallelization comes largely from the data volume.

The basic concept is simple to describe, and Google uses the canonical example of counting word occurrences in text. There are two steps for this, Map and Reduce:

1. The Map step splits the text into a list of words. MapReduce generally works with key/value pairs, so examples usually use a data type for this with a pair of the word and a 1 (one).

2. For each unique key in the list of pairs, the Reduce step is executed, given the key (in the word counting example, that’s the word itself) and a list of values (again, in the example, that’s a list ...

Get Functional Programming in C#: Classic Programming Techniques for Modern Projects now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.