O'Reilly logo

Parallel R by Stephen Weston, Q. Ethan McCallum

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Getting Started

This chapter sets the pace for the rest of the book. If you’re in a hurry, feel free to skip to the chapter you need. (The section In a Hurry? has a quick-ref look at the various strategies and where they fit. That should help you pick a starting point.) Just make sure you come back here to understand our choice of vocabulary, how we chose what to cover, and so on.

Why R?

It’s tough to argue with R. Who could dislike a high-quality, cross-platform, open-source statistical software product? It has an interactive console for exploratory work. It can run as a scripting language to repeat a process you’ve captured. It has a lot of statistical calculations built-in so you don’t have to reinvent the wheel. Did we mention that R is free?

When the base toolset isn’t enough, R users have access to a rich ecosystem of add-on packages and a gaggle of GUIs to make their lives even easier. No wonder R has become a favorite in the age of Big Data.

Since R is perfect, then, we can end this book. Right?

Not quite. It’s precisely the Big Data age that has exposed R’s blemishes.

Why Not R?

These imperfections stem not from defects in the software itself, but from the passage of time: quite simply, R was not built in anticipation of the Big Data revolution.

R was born in 1995. Disk space was expensive, RAM even more so, and this thing called The Internet was just getting its legs. Notions of “large-scale data analysis” and “high-performance computing” were reasonably rare. Outside of Wall Street firms and university research labs, there just wasn’t that much data to crunch.

Fast-forward to the present day and hardware costs just a fraction of what it used to. Computing power is available online for pennies. Everyone is suddenly interested in collecting and analyzing data, and the necessary resources are well within reach.

This surge in data analysis has brought two of R’s limitations to the forefront: it’s single-threaded and memory-bound. Allow us to explain:

It’s single-threaded

The R language has no explicit constructs for parallelism, such as threads or mutexes. An out-of-the-box R install cannot take advantage of multiple CPUs.

It’s memory-bound

R requires that your entire dataset[1] fit in memory (RAM).[2] Four gigabytes of RAM will not hold eight gigabytes of data, no matter how much you smile when you ask.

While these are certainly inconvenient, they’re hardly insurmountable.

The Solution: Parallel Execution

People have created a series of workarounds over the years. Doing a lot of matrix math? You can build R against a multithreaded basic linear algebra subprogram (BLAS). Churning through large datasets? Use a relational database or another manual method to retrieve your data in smaller, more manageable pieces. And so on, and so forth.

Some big winners involve parallelism. Spreading work across multiple CPUs overcomes R’s single-threaded nature. Offloading work to multiple machines reaps the multi-process benefit and also addresses R’s memory barrier. In this book we’ll cover a few strategies to give R that parallel boost, specifically those which take advantage of modern multicore hardware and cheap distributed computing.

A Road Map for This Book

Now that we’ve set the tone for why we’re here, let’s take a look at what we plan to accomplish in the coming pages (or screens if you’re reading this electronically).

What We’ll Cover

Each chapter is a look into one strategy for R parallelism, including:

  • What it is

  • Where to find it

  • How to use it

  • Where it works well, and where it doesn’t

First up is the snow package, followed by a tour of the multicore package. We then provide a look at the new parallel package that’s due to arrive in R 2.14. After that, we’ll take a brief side-tour to explain MapReduce and Hadoop. That will serve as a foundation for the remaining chapters: R+Hadoop (Hadoop streaming and the Java API), RHIPE, and segue.

Looking Forward…

In Chapter 9, we will briefly mention some tools that were too new for us to cover in-depth.

There will likely be other tools we hadn’t heard about (or that didn’t exist) at the time of writing.[3] Please let us know about them! You can reach us through this book’s website at http://parallelrbook.com/.

What We’ll Assume You Already Know

This is a book about R, yes, but we’ll expect you know the basics of how to get around. If you’re new to R or need a refresher course, please flip through Paul Teetor’s R Cookbook (O’Reilly), Robert Kabacoff’s R In Action (Manning), or another introductory title. You should take particular note of the lapply() function, which plays an important role in this book.

Some of the topics require several machines’ worth of infrastructure, in which case you’ll need access to a talented sysadmin. You’ll also need hardware, which you can buy and maintain yourself, or rent from a hosting provider. Cloud services, notably Amazon Web Services (AWS), [4] have become a popular choice in this arena. AWS has plenty of documentation, and you can also read Programming Amazon EC2, by Jurg van Vliet and Flavia Paganelli (O’Reilly) as a supplement.

(Please note that using a provider still requires a degree of sysadmin knowledge. If you’re not up to the task, you’ll want to find and bribe your skilled sysadmin friends.)

In a Hurry?

If you’re in a hurry, you can skip straight to the chapter you need. The list below is a quick look at the various strategies.

snow

Overview: Good for use on traditional clusters, especially if MPI is available. It supports MPI, PVM, nws, and sockets for communication, and is quite portable, running on Linux, Mac OS X, and Windows.

Solves: Single-threaded, memory-bound.

Pros: Mature, popular package; leverages MPI’s speed without its complexity.

Cons: Can be difficult to configure.

multicore

Overview: Good for big-CPU problems when setting up a Hadoop cluster is too much of a hassle. Lets you parallelize your R code without ever leaving the R interpreter.

Solves: Single-threaded.

Pros: Simple and efficient; easy to install; no configuration needed.

Cons: Can only use one machine; doesn’t support Windows; no built-in support for parallel random number generation (RNG).

parallel

Overview: A merger of snow and multicore that comes built into R as of R 2.14.0.

Solves: Single-threaded, memory-bound.

Pros: No installation necessary; has great support for parallel random number generation.

Cons: Can only use one machine on Windows; can be difficult to configure on multiple Linux machines.

R+Hadoop

Overview: Run your R code on a Hadoop cluster.

Solves: Single-threaded, memory-bound.

Pros: You get Hadoop’s scalability.

Cons: Requires a Hadoop cluster (internal or cloud-based); breaks up a single logical process into multiple scripts and steps (can be a hassle for exploratory work).

RHIPE

Overview: Talk Hadoop without ever leaving the R interpreter.

Solves: Single-threaded, memory-bound.

Pros: Closer to a native R experience than R+Hadoop; use pure R code for your MapReduce operations.

Cons: Requires a Hadoop cluster; requires extra setup on the cluster; cannot process standard SequenceFiles (for binary data).

Segue

Overview: Seamlessly send R apply-like calculations to a remote Hadoop cluster.

Solves: Single-threaded, memory-bound.

Pros: Abstracts you from Elastic MapReduce management.

Cons: Cannot use with an internal Hadoop cluster (you’re tied to Amazon’s Elastic MapReduce).

Summary

Welcome to the beginning of your journey into parallel R. Our first stop is a look at the popular snow package.



[1] We emphasize “dataset” here, not necessarily “algorithms.”

[2] It’s a big problem. Because R will often make multiple copies of the same data structure for no apparent reason, you often need three times as much memory as the size of your dataset. And if you don’t have enough memory, you die a slow death as your poor machine swaps and thrashes. Some people turn off virtual memory with the swapoff command so they can die quickly.

[3] Try as we might, our massive Monte Carlo simulations have brought us no closer to predicting the next R parallelism strategy. Nor any winning lottery numbers, for that matter.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required