Implementing MapReduce to count word frequencies

MapReduce is a framework for efficient parallel algorithms that take advantage of divide and conquer. If a task can be split into smaller tasks, and the results of each individual task can be combined to form the final answer, then MapReduce is likely the best framework for this job.

In the following figure, we can see that a large list is split up, and the mapper functions work in parallel on each split. After all the mapping is complete, the second phase of the framework kicks in, reducing the various calculations into one final answer.

In this recipe, we will be counting word frequencies in a large corpus of text. Given many files of words, we will apply the MapReduce framework to find the word ...

Get Haskell Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.