You are previewing Mastering Parallel Programming with R.
O'Reilly logo
Mastering Parallel Programming with R

Book Description

Master the robust features of R parallel programming to accelerate your data science computations

About This Book

  • Create R programs that exploit the computational capability of your cloud platforms and computers to the fullest

  • Become an expert in writing the most efficient and highest performance parallel algorithms in R

  • Get to grips with the concept of parallelism to accelerate your existing R programs

  • Who This Book Is For

    This book is for R programmers who want to step beyond its inherent single-threaded and restricted memory limitations and learn how to implement highly accelerated and scalable algorithms that are a necessity for the performant processing of Big Data. No previous knowledge of parallelism is required. This book also provides for the more advanced technical programmer seeking to go beyond high level parallel frameworks.

    What You Will Learn

  • Create and structure efficient load-balanced parallel computation in R, using R’s built-in parallel package

  • Deploy and utilize cloud-based parallel infrastructure from R, including launching a distributed computation on Hadoop running on Amazon Web Services (AWS)

  • Get accustomed to parallel efficiency, and apply simple techniques to benchmark, measure speed and target improvement in your own code

  • Develop complex parallel processing algorithms with the standard Message Passing Interface (MPI) using RMPI, pbdMPI, and SPRINT packages

  • Build and extend a parallel R package (SPRINT) with your own MPI-based routines

  • Implement accelerated numerical functions in R utilizing the vector processing capability of your Graphics Processing Unit (GPU) with OpenCL

  • Understand parallel programming pitfalls, such as deadlock and numerical instability, and the approaches to handle and avoid them

  • Build a task farm master-worker, spatial grid, and hybrid parallel R programs

  • In Detail

    R is one of the most popular programming languages used in data science. Applying R to big data and complex analytic tasks requires the harnessing of scalable compute resources.

    Mastering Parallel Programming with R presents a comprehensive and practical treatise on how to build highly scalable and efficient algorithms in R. It will teach you a variety of parallelization techniques, from simple use of R’s built-in parallel package versions of lapply(), to high-level AWS cloud-based Hadoop and Apache Spark frameworks. It will also teach you low level scalable parallel programming using RMPI and pbdMPI for message passing, applicable to clusters and supercomputers, and how to exploit thousand-fold simple processor GPUs through ROpenCL. By the end of the book, you will understand the factors that influence parallel efficiency, including assessing code performance and implementing load balancing; pitfalls to avoid, including deadlock and numerical instability issues; how to structure your code and data for the most appropriate type of parallelism for your problem domain; and how to extract the maximum performance from your R code running on a variety of computer systems.

    Style and approach

    This book leads you chapter by chapter from the easy to more complex forms of parallelism. The author’s insights are presented through clear practical examples applied to a range of different problems, with comprehensive reference information for each of the R packages employed. The book can be read from start to finish, or by dipping in chapter by chapter, as each chapter describes a specific parallel approach and technology, so can be read as a standalone.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Mastering Parallel Programming with R
      1. Table of Contents
      2. Mastering Parallel Programming with R
      3. Credits
      4. About the Authors
      5. About the Reviewers
      6. www.PacktPub.com
        1. eBooks, discount offers, and more
          1. Why subscribe?
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Simple Parallelism with R
        1. Aristotle's Number Puzzle
          1. Solver implementation
          2. Refining the solver
            1. Measuring the execution time
              1. Instrumenting code
          3. Splitting the problem into multiple tasks
            1. Executing multiple tasks with lapply()
        2. The R parallel package
          1. Using mclapply()
            1. Options for mclapply()
          2. Using parLapply()
          3. Parallel load balancing
        3. The segue package
          1. Installing segue
          2. Setting up your AWS account
          3. Running segue
            1. Options for createCluster()
            2. AWS console views
          4. Solving Aristotle's Number Puzzle
            1. Analyzing the results
        4. Summary
      9. 2. Introduction to Message Passing
        1. Setting up your system environment for MPI
          1. Choice of R packages for MPI
          2. Choice of MPI subsystems
          3. Installing OpenMPI
        2. The MPI standard
          1. The MPI universe
          2. Installing Rmpi
          3. Installing pbdMPI
        3. The MPI API
          1. Point-to-point blocking communications
            1. MPI intracommunicators
              1. The Rmpi workerdaemon.R script
          2. Point-to-point non-blocking communications
          3. Collective communications
        4. Summary
      10. 3. Advanced Message Passing
        1. Grid parallelism
          1. Creating the grid cluster
          2. Boundary data exchange
          3. The median filter
          4. Distributing the image as tiles
          5. Median filter grid program
            1. Performance
        2. Inspecting and managing communications
        3. Variants on lapply()
          1. parLapply() with Rmpi
        4. Summary
      11. 4. Developing SPRINT, an MPI-Based R Package for Supercomputers
        1. About ARCHER
        2. Calling MPI code from R
          1. MPI Hello World
          2. Calling C from R
            1. Modifying C code to make it callable from R
            2. Compiling MPI code into an R shared object
            3. Calling the MPI Hello World example from R
        3. Building an MPI R package – SPRINT
          1. The Simple Parallel R Interface (SPRINT) package
            1. Using a prebuilt SPRINT routine in an R script
          2. The architecture of the SPRINT package
        4. Adding a new function to the SPRINT package
          1. Downloading the SPRINT source code
          2. Creating a stub in R – phello.R
          3. Adding the interface function – phello.c
          4. Adding the implementation function – hello.c
          5. Connecting the stub, interface, and implementation
            1. functions.h
            2. functions.c
            3. Namespace
            4. Makefile
          6. Compiling and running the SPRINT code
        5. Genomics analysis case study
          1. Genomics
          2. Genomic data
        6. Genomics with a supercomputer
          1. The goal
          2. The ARCHER supercomputer
          3. Random Forests
          4. Data for the genomics analysis case study
          5. Random Forests performance on ARCHER
          6. Rank product
          7. Rank product performance on ARCHER
          8. Conclusions
        7. Summary
      12. 5. The Supercomputer in Your Laptop
        1. OpenCL
          1. Querying the OpenCL capabilities of your system
        2. The ROpenCL package
          1. The ROpenCL programming model
            1. A simple vector addition example
            2. The kernel function
              1. Line 1
              2. Line 2
              3. Line 3
              4. Memory qualifiers
              5. Understanding NDRange
          2. Distance matrix example
            1. Index of Multiple Deprivation
              1. Memory requirements
            2. GPU out-of-core memory processing
              1. The setup
              2. Kernel function dist1
              3. Work block control loop
              4. The kernel function dist2
        3. Summary
      13. 6. The Art of Parallel Programming
        1. Understanding parallel efficiency
          1. SpeedUp
          2. Amdahl's law
          3. To parallelize or not to parallelize
            1. Chapple's law
        2. Numerical approximation
        3. Random numbers
        4. Deadlock
          1. Avoiding deadlock
        5. Reducing the parallel overhead
        6. Adaptive load balancing
          1. The task farm
          2. Efficient grid processing
        7. Three steps to successful parallelization
        8. What does the future hold?
        9. Hybrid parallelism
        10. Summary
      14. Index