O'Reilly logo

Doing Data Science by Cathy O'Neil, Rachel Schutt

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 14. Data Engineering: MapReduce, Pregel, and Hadoop

We have two contributors to this chapter, David Crawshaw and Josh Wills. Rachel worked with both of them at Google on the Google+ data science team, though the two of them never actually worked together because Josh Wills left to go to Cloudera and David Crawshaw replaced him in the role of tech lead. We can call them “data engineers,” although that term might be as problematic (or potentially overloaded) or ambiguous as “data scientist”—but suffice it to say that they’ve both worked as software engineers and dealt with massive amounts of data. If we look at the data science process from Chapter 2, Josh and David were responsible at Google for collecting data (frontend and backend logging), building the massive data pipelines to store and munge the data, and building up the engineering infrastructure to support analysis, dashboards, analytics, A/B testing, and more broadly, data science.

In this chapter we’ll hear firsthand from Google engineers about MapReduce, which was developed at Google, and then open source versions were created elsewhere. MapReduce is an algorithm and framework for dealing with massive amounts of data that has recently become popular in industry. The goal of this chapter is to clear up some of the mysteriousness surrounding MapReduce. It’s become such a buzzword, and many data scientist job openings are advertised as saying “must know Hadoop” (the open source implementation of MapReduce). We suspect ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required