Other Packages for Parallel Computation with R

Segue

The segue package by JD Long is a great choice for running simple parallel programs; it’s intended to be a gentle introduction to parallel computation. Segue runs programs in the cloud using AWS’s Elastic MapReduce service. (This is a distinct product from EC2, which I used to install my own private Hadoop cluster.) It borrows some Hadoop infrastructure, but it isn’t a full map/reduce package. Segue is modeled on the apply function in R; you use it to apply a function to a data set across a set of computers in the cloud. Let’s show how it works.

The segue package is hosted on Google Code, not CRAN. To install it, you can use the install_url command in the devtools package:

> library(devtools)
> # At the time I wrote this book, the current version was 0.05;
> # make sure to change the link to get the latest version:
> install_url("http://segue.googlecode.com/files/segue_0.05.tar.gz")

You’ll need an Amazon Web Services account to use it.

Warning

You will be billed by the hour for using AWS. Make sure that you understand how you will be charged and how to use AWS before you start.

You’ll need to get your Access Key ID and Secret Access Key from AWS’s Security Credentials page.

> library(segue)
Loading required package: rJava
Loading required package: caTools
Loading required package: bitops
Segue did not find your AWS credentials. Please run the setCredentials()
function.
> # set aws.access.id to your amazon access id, aws.secret.key to ...

Get R in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.