MapReduce Design Patterns

Preface

Welcome to MapReduce Design Patterns! This book will be unique in some ways and familiar in others. First and foremost, this book is obviously about design patterns, which are templates or general guides to solving problems. We took a look at other design patterns books that have been written in the past as inspiration, particularly Design Patterns: Elements of Reusable Object-Oriented Software, by Gamma et al. (1995), which is commonly referred to as “The Gang of Four” book. For each pattern, you’ll see a template that we reuse over and over that we loosely based off of their book. Repeatedly seeing a similar template will help you get to the specific information you need. This will be especially useful in the future when using this book as a reference.

This book is a bit more open-ended than a book in the “cookbook” series of texts as we don’t call out specific problems. However, similarly to the cookbooks, the lessons in this book are short and categorized. You’ll have to go a bit further than just copying and pasting our code to solve your problems, but we hope that you will find a pattern to get you at least 90% of the way for just about all of your challenges.

This book is mostly about the analytics side of Hadoop or MapReduce. We intentionally try not to dive into too much detail on how Hadoop or MapReduce works or talk too long about the APIs that we are using. These topics have been written about quite a few times, both online and in print, so we decided to focus on analytics.

In this preface, we’ll talk about how to read this book since its format might be a bit different than most books you’ve read.

Intended Audience

The motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with writing MapReduce, but were lacking the experience to understand how to do things right or well. The intent of this book is to prevent you from having to make some of your own mistakes by educating you on how experts have figured out how to solve problems with MapReduce. So, in some ways, this book can be viewed as an intermediate or advanced MapReduce developer resource, but we think early beginners and gurus will find use out of it.

This book is also intended for anyone wanting to learn more about the MapReduce paradigm. The book goes deeply into the technical side of MapReduce with code examples and detailed explanations of the inner workings of a MapReduce system, which will help software engineers develop MapReduce analytics. However, quite a bit of time is spent discussing the motivation of some patterns and the common use cases for these patterns, which could be interesting to someone who just wants to know what a system like Hadoop can do.

To get the most out of this book, we suggest you have some knowledge of Hadoop, as all of the code examples are written for Hadoop and many of the patterns are discussed in a Hadoop context. A brief refresher will be given in the first chapter, along with some suggestions for additional reading material.

Pattern Format

The patterns in this book follow a single template format so they are easier to read in succession. Some patterns will omit some of the sections if they don’t make sense in the context of that pattern.

Intent

This section is a quick description of the problem the pattern is intended to solve.

Motivation

This section explains why you would want to solve this problem or where it would appear. Some use cases are typically discussed in brief.

Applicability

This section contains a set of criteria that must be true to be able to apply this pattern to a problem. Sometimes these are limitations in the design of the pattern and sometimes they help you make sure this pattern will work in your situation.

Structure

This section explains the layout of the MapReduce job itself. It’ll explain what the map phase does, what the reduce phase does, and also lets you know if it’ll be using any custom partitioners, combiners, or input formats. This is the meat of the pattern and explains how to solve the problem.

Consequences

This section is pretty short and just explains what the output of the pattern will be. This is the end goal of the output this pattern produces.

Resemblances

For readers that have some experience with SQL or Pig, this section will show analogies of how this problem would be solved with these other languages. You may even find yourself reading this section first as it gets straight to the point of what this pattern does.

Sometimes, SQL, Pig, or both are omitted if what we are doing with MapReduce is truly unique.

Known Uses

This section outlines some common use cases for this pattern.

Performance Analysis

This section explains the performance profile of the analytic produced by the pattern. Understanding this is important because every MapReduce analytic needs to be tweaked and configured properly to maximize performance. Without the knowledge of what resources it is using on your cluster, it would be difficult to do this.

The Examples in This Book

All of the examples in this book are written for Hadoop version 1.0.3. MapReduce is a paradigm that is seen in a number of open source and commercial systems these days, but we had to pick one to make our examples consistent and easy to follow, so we picked Hadoop. Hadoop was a logical choice since it a widely used system, but we hope that users of MongoDB’s MapReduce and other MapReduce implementations will be able to extrapolate the examples in this text to their particular system of choice.

Caution

In general, we try to use the newer mapreduce API for all of our examples, not the deprecated mapred API. Just be careful when mixing code from this book with other sources, as plenty of people still use mapred and their APIs are not compatible.

Our examples generally omit any sort of error handling, mostly to make the code more terse. In real-world big data systems, you can expect your data to be malformed and you’ll want to be proactive in handling those situations in your analytics.

We use the same data set throughout this text: a dump of StackOverflow’s databases. StackOverflow is a popular website in which software developers can go to ask and answer questions about any coding topic (including Hadoop). This data set was chosen because it is reasonable in size, yet not so big that you can’t use it on a single node. This data set also contains human-generated natural language text as well as “structured” elements like usernames and dates.

Throughout the examples in this book, we try to break out parsing logic of this data set into helper functions to clearly distinguish what code is specific to this data set and which code is general and part of the pattern. Since the XML is pretty simple, we usually avoid using a full-blown XML parser and just parse it with some string operations in our Java code.

The data set contains five tables, of which we only use three: comments, posts, and users. All of the data is in well-formed XML, with one record per line.

We use the following three StackOverflow tables in this book:

comments

<row Id="2579740" PostId="2573882" Text="Are you getting any results? What
are you specifying as the command text?" CreationDate="2010-04-04T08:48:51.347"
UserId="95437" />

Comments are follow-up questions or suggestions users of the site can leave on posts (i.e., questions or answers).

posts

<row Id="6939296" PostTypeId="2" ParentId="6939137"
CreationDate="2011-08-04T09:50:25.043" Score="4" ViewCount=""
Body="&lt;p&gt;You should have imported Poll with &lt;code&gt;
from polls.models import Poll&lt;/code&gt;&lt;/p&gt;&#xA;"
OwnerUserId="634150" LastActivityDate="2011-08-04T09:50:25.043"
CommentCount="1" />

<row Id="6939304" PostTypeId="1" AcceptedAnswerId="6939433"
CreationDate="2011-08-04T09:50:58.910" Score="1" ViewCount="26"
Body="&lt;p&gt;Is it possible to gzip a single asp.net 3.5 page? my
site is hosted on IIS7 and for technical reasons I cannot enable gzip
compression site wide. does IIS7 have an option to gzip individual pages or
will I have to override OnPreRender and write some code to compress the
output?&lt;/p&gt;&#xA;" OwnerUserId="743184"
LastActivityDate="2011-08-04T10:19:04.107" Title="gzip a single asp.net page"
Tags="&lt;asp.net&gt;&lt;iis7&gt;&lt;gzip&gt;"
AnswerCount="2" />

Posts contain the questions and answers on the site. A user will post a question, and then other users are free to post answers to that question. Questions and answers can be upvoted and downvoted depending on if you think the post is constructive or not. In order to help categorize the questions, the creator of the question can specify a number of “tags,” which say what the post is about. In the example above, we see that this post is about asp.net, iis, and gzip.

One thing to notice is that the body of the post is escaped HTML. This makes parsing it a bit more challenging, but it’s not too bad with all the tools available. Most of the questions and many of the answers can get to be pretty long!

Posts are a bit more challenging because they contain both answers and questions intermixed. Questions have a PostTypeId of 1, while answers have a PostTypeId of 2. Answers point to their related question via the ParentId, a field that questions do not have. Questions, however, have a Title and Tags.

users

<row Id="352268" Reputation="3313" CreationDate="2010-05-27T18:34:45.817"
DisplayName="orangeoctopus" EmailHash="93fc5e3d9451bcd3fdb552423ceb52cd"
LastAccessDate="2011-09-01T13:55:02.013" Location="Maryland" Age="26"
Views="48" UpVotes="294" DownVotes="4" />

The users table contains all of the data about the account holders on StackOverflow. Most of this information shows up in the user’s profile.

Users of StackOverflow have a reputation score, which goes up as other users upvote questions or answers that user has submitted to the website.

To learn more about the data set, refer to the documentation included with the download in README.txt.

In the examples, we parse the data set with a helper function that we wrote. This function takes in a line of StackOverflow data and returns a HashMap. This HashMap stores the labels as the keys and the actual data as the value.

package mrdp.utils;

import java.util.HashMap;
import java.util.Map;

public class MRDPUtils {

   // This helper function parses the stackoverflow into a Map for us.
   public static Map<String, String> transformXmlToMap(String xml) {
      Map<String, String> map = new HashMap<String, String>();
      try {
         // exploit the fact that splitting on double quote
         //  tokenizes the data nicely for us
         String[] tokens = xml.trim().substring(5, xml.trim().length() - 3)
            .split("\"");

         for (int i = 0; i < tokens.length - 1; i += 2) {
            String key = tokens[i].trim();
            String val = tokens[i + 1];

            map.put(key.substring(0, key.length() - 1), val);
         }
      } catch (StringIndexOutOfBoundsException e) {
         System.err.println(xml);
      }

      return map;
   }
}

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic: Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width: Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold: Shows commands or other text that should be typed literally by the user.
Constant width italic: Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This icon signifies a tip or suggestion.

Note

This icon signifies a general note.

Caution

This icon indicates a warning or caution.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “MapReduce Design Patterns by Donald Miner and Adam Shook (O’Reilly). Copyright 2013 Donald Miner and Adam Shook, 978-1-449-32717-0.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.

Safari® Books Online

Note

Safari Books Online (www.safaribooksonline.com) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of plans and pricing for enterprise, government, and education, and individuals.

Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/mapreduce-design-patterns.

To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

Books published by O’Reilly are always top notch and now we know why first hand. The support staff, especially our editor Andy Oram, has been extremely helpful in guiding us through this process. They give freedom to the authors to convey the message while supporting us in any way we need.

A special thanks goes out to those that read our book and provided useful commentary and reviews: Tom Wheeler, Patrick Angeles, Tom Kulish, and Lance Byrd. Thanks to Jeff Gold for providing some early encouragement and comments. We appreciate Eric Sammer’s help in finding reviewers and wish him luck with his book Hadoop Operations.

The StackOverflow data set, which is used throughout this book, is freely available under the Creative Commons license. It’s great that people are willing to spend the time to release the data set so that projects like this can make use of the content. What a truly wonderful contribution.

Don would like to thank the support he got from coworkers at Greenplum, who provided slack in my schedule to work on this project, moral support, and technical suggestions. These folks from Greenplum have helped in one way or another, whether they realize it or not: Ian Andrews, Dan Baskette, Nick Cayou, Paul Cegielski, Will Davis, Andrew Ettinger, Mike Goddard, Jacque Istok, Mike Maxey, Michael Parks, and Parham Parvizi. Also, thanks to Andy O’Brien for contributing the chapter on Postgres.

Adam would like to thank his family, friends, and caffeine.

Get MapReduce Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

MapReduce Design Patterns by Donald Miner, Adam Shook

Preface

Intended Audience

Pattern Format

The Examples in This Book

Caution

Conventions Used in This Book

Tip

Note

Caution

Using Code Examples

Safari® Books Online

Note

How to Contact Us

Acknowledgments

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly