You are previewing Multicore and GPU Programming.
O'Reilly logo
Multicore and GPU Programming

Book Description

Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today’s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm.

Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines.



  • Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA
  • Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance
  • Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems
  • Download source code, examples, and instructor support materials on the book's companion website

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. List of Tables
  7. Preface
    1. What is in this Book
    2. Using this Book as a Textbook
    3. Software and Hardware Requirements
    4. Sample Code
  8. Chapter 1: Introduction
    1. Abstract
    2. In this chapter you will
    3. 1.1 The era of multicore machines
    4. 1.2 A taxonomy of parallel machines
    5. 1.3 A glimpse of contemporary computing machines
    6. 1.4 Performance metrics
    7. 1.5 Predicting and measuring parallel program performance
    8. Exercises
  9. Chapter 2: Multicore and parallel program design
    1. Abstract
    2. In this chapter you will
    3. 2.1 Introduction
    4. 2.2 The PCAM methodology
    5. 2.3 Decomposition patterns
    6. 2.4 Program structure patterns
    7. 2.5 Matching decomposition patterns with program structure patterns
    8. Exercises
  10. Chapter 3: Shared-memory programming: threads
    1. Abstract
    2. In this chapter you will
    3. 3.1 Introduction
    4. 3.2 Threads
    5. 3.3 Design concerns
    6. 3.4 Semaphores
    7. 3.5 Applying semaphores in classical problems
    8. 3.6 Monitors
    9. 3.7 Applying monitors in classical problems
    10. 3.8 Dynamic vs. static thread management
    11. 3.9 Debugging multithreaded applications
    12. 3.10 Higher-level constructs: multithreaded programming without threads
    13. Exercises
  11. Chapter 4: Shared-memory programming: OpenMP
    1. Abstract
    2. In this chapter you will
    3. 4.1 Introduction
    4. 4.2 Your first OpenMP program
    5. 4.3 Variable scope
    6. 4.4 Loop-level parallelism
    7. 4.5 Task parallelism
    8. 4.6 Synchronization constructs
    9. 4.7 Correctness and optimization issues
    10. 4.8 A case study: sorting in OpenMP
  12. Chapter 5: Distributed memory programming
    1. Abstract
    2. In this chapter you will
    3. 5.1 Communicating processes
    4. 5.2 MPI
    5. 5.3 Core concepts
    6. 5.4 Your first MPI program
    7. 5.5 Program architecture
    8. 5.6 Point-to-point communication
    9. 5.7 Alternative point-to-point communication modes
    10. 5.8 Non blocking communications
    11. 5.9 Point-to-point communications: summary
    12. 5.10 Error reporting and handling
    13. 5.11 Collective communications
    14. 5.12 Communicating objects
    15. 5.13 Node management: communicators and groups
    16. 5.14 One-sided communications
    17. 5.15 I/O considerations
    18. 5.16 Combining MPI processes with threads
    19. 5.17 Timing and performance measurements
    20. 5.18 Debugging and profiling MPI Programs
    21. 5.19 The boost.MPI Library
    22. 5.20 A case study: diffusion-limited aggregation
    23. 5.21 A case study: brute-force encryption cracking
    24. 5.22 A case study: MPI implementation of the master-worker pattern
    25. Exercises
  13. Chapter 6: GPU programming
    1. Abstract
    2. In this chapter you will
    3. 6.1 GPU programming
    4. 6.2 CUDA’S programming model: Threads, blocks, and grids
    5. 6.3 CUDA’S execution model: Streaming multiprocessors and warps
    6. 6.4 CUDA compilation process
    7. 6.5 Putting together a CUDA project
    8. 6.6 Memory hierarchy
    9. 6.7 Optimization techniques
    10. 6.8 Dynamic parallelism
    11. 6.9 Debugging CUDA programs
    12. 6.10 Profiling CUDA programs
    13. 6.11 CUDA and MPI
    14. 6.12 Case studies
    15. Exercises
  14. Chapter 7: The Thrust template library
    1. Abstract
    2. In this chapter you will
    3. 7.1 Introduction
    4. 7.2 First steps in thrust
    5. 7.3 Working with thrust datatypes
    6. 7.4 Thrust algorithms
    7. 7.5 Fancy iterators
    8. 7.6 Switching device back ends
    9. 7.7 Case studies
    10. Exercises
  15. Chapter 8: Load balancing
    1. Abstract
    2. In this chapter you will
    3. 8.1 Introduction
    4. 8.2 Dynamic load balancing: the linda legacy
    5. 8.3 Static load balancing: the divisible load theory approach
    6. 8.4 Dltlib: a library for partitioning workloads
    7. 8.5 Case studies
    8. Exercises
  16. Appendix A: Compiling Qt programs
    1. A.1 Using an IDE
    2. A.2 The <span xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" class="inlinecode">qmake</span> Utility Utility
  17. Appendix B: Running MPI programs: preparatory configuration steps
    1. B.1 Preparatory Steps
    2. B.2 Computing Nodes Discovery for MPI Program Deployment
  18. Appendix C: Time measurement
    1. C.1 Introduction
    2. C.2 Posix High-Resolution Timing
    3. C.3 Timing in Qt
    4. C.4 Timing in OpenMP
    5. C.5 Timing in MPI
    6. C.6 Timing in CUDA
  19. Appendix D: Boost.MPI
    1. D.1 Mapping from MPI C to Boost.MPI
  20. Appendix E: Setting up CUDA
    1. E.1 Installation
    2. E.2 Issues with GCC
    3. E.3 Running CUDA Without an Nvidia GPU
    4. E.4 Running CUDA on Optimus-Equipped Laptops
    5. E.5 Combining CUDA with Third-Party Libraries
  21. Appendix F: DLTlib
    1. F.1 DLTlib Functions
    2. F.2 Dltlib Files
  22. Glossary
  23. Bibliography
  24. Index