Chapter 14. Data Engineering: MapReduce, Pregel, and Hadoop

We have two contributors to this chapter, David Crawshaw and Josh Wills. Rachel worked with both of them at Google on the Google+ data science team, though the two of them never actually worked together because Josh Wills left to go to Cloudera and David Crawshaw replaced him in the role of tech lead. We can call them “data engineers,” although that term might be as problematic (or potentially overloaded) or ambiguous as “data scientist”—but suffice it to say that they’ve both worked as software engineers and dealt with massive amounts of data. If we look at the data science process from Chapter 2, Josh and David were responsible at Google for collecting data (frontend and backend logging), building the massive data pipelines to store and munge the data, and building up the engineering infrastructure to support analysis, dashboards, analytics, A/B testing, and more broadly, data science.

In this chapter we’ll hear firsthand from Google engineers about MapReduce, which was developed at Google, and then open source versions were created elsewhere. MapReduce is an algorithm and framework for dealing with massive amounts of data that has recently become popular in industry. The goal of this chapter is to clear up some of the mysteriousness surrounding MapReduce. It’s become such a buzzword, and many data scientist job openings are advertised as saying “must know Hadoop” (the open source implementation of MapReduce). We suspect ...

Get Doing Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.