You are previewing Programming Massively Parallel Processors.
O'Reilly logo
Programming Massively Parallel Processors

Book Description

Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth.

This best-selling guide to CUDA and GPU parallel programming has been revised with more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. With these improvements, the book retains its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.



Updates in this new edition include:

  • New coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more
  • Increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism
  • Two new case studies (on MRI reconstruction and molecular visualization) explore the latest applications of CUDA and GPUs for scientific research and high-performance computing

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Preface
    1. Target Audience
    2. How to Use the Book
    3. Online Supplements
  6. Acknowledgements
  7. Dedication
  8. Chapter 1. Introduction
    1. 1.1 Heterogeneous Parallel Computing
    2. 1.2 Architecture of a Modern GPU
    3. 1.3 Why More Speed or Parallelism?
    4. 1.4 Speeding Up Real Applications
    5. 1.5 Parallel Programming Languages and Models
    6. 1.6 Overarching Goals
    7. 1.7 Organization of the Book
    8. References
  9. Chapter 2. History of GPU Computing
    1. 2.1 Evolution of Graphics Pipelines
    2. 2.2 GPGPU: An Intermediate Step
    3. 2.3 GPU Computing
    4. References and Further Reading
  10. Chapter 3. Introduction to Data Parallelism and CUDA C
    1. 3.1 Data Parallelism
    2. 3.2 CUDA Program Structure
    3. 3.3 A Vector Addition Kernel
    4. 3.4 Device Global Memory and Data Transfer
    5. 3.5 Kernel Functions and Threading
    6. 3.6 Summary
    7. 3.7 Exercises
    8. References
  11. Chapter 4. Data-Parallel Execution Model
    1. 4.1 Cuda Thread Organization
    2. 4.2 Mapping Threads to Multidimensional Data
    3. 4.3 Matrix-Matrix Multiplication—A More Complex Kernel
    4. 4.4 Synchronization and Transparent Scalability
    5. 4.5 Assigning Resources to Blocks
    6. 4.6 Querying Device Properties
    7. 4.7 Thread Scheduling and Latency Tolerance
    8. 4.8 Summary
    9. 4.9 Exercises
  12. Chapter 5. CUDA Memories
    1. 5.1 Importance of Memory Access Efficiency
    2. 5.2 CUDA Device Memory Types
    3. 5.3 A Strategy for Reducing Global Memory Traffic
    4. 5.4 A Tiled Matrix–Matrix Multiplication Kernel
    5. 5.5 Memory as a Limiting Factor to Parallelism
    6. 5.6 Summary
    7. 5.7 Exercises
  13. Chapter 6. Performance Considerations
    1. 6.1 Warps and Thread Execution
    2. 6.2 Global Memory Bandwidth
    3. 6.3 Dynamic Partitioning of Execution Resources
    4. 6.4 Instruction Mix and Thread Granularity
    5. 6.5 Summary
    6. 6.6 Exercises
    7. References
  14. Chapter 7. Floating-Point Considerations
    1. 7.1 Floating-Point Format
    2. 7.2 Representable Numbers
    3. 7.3 Special Bit Patterns and Precision in IEEE Format
    4. 7.4 Arithmetic Accuracy and Rounding
    5. 7.5 Algorithm Considerations
    6. 7.6 Numerical Stability
    7. 7.7 Summary
    8. 7.8 Exercises
    9. References
  15. Chapter 8. Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches
    1. 8.1 Background
    2. 8.2 1D Parallel Convolution—A Basic Algorithm
    3. 8.3 Constant Memory and Caching
    4. 8.4 Tiled 1D Convolution with Halo Elements
    5. 8.5 A Simpler Tiled 1D Convolution—General Caching
    6. 8.6 Summary
    7. 8.7 Exercises
  16. Chapter 9. Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms
    1. 9.1 Background
    2. 9.2 A Simple Parallel Scan
    3. 9.3 Work Efficiency Considerations
    4. 9.4 A Work-Efficient Parallel Scan
    5. 9.5 Parallel Scan for Arbitrary-Length Inputs
    6. 9.6 Summary
    7. 9.7 Exercises
    8. Reference
  17. Chapter 10. Parallel Patterns: Sparse Matrix–Vector Multiplication: An Introduction to Compaction and Regularization in Parallel Algorithms
    1. 10.1 Background
    2. 10.2 Parallel SpMV Using CSR
    3. 10.3 Padding and Transposition
    4. 10.4 Using Hybrid to Control Padding
    5. 10.5 Sorting and Partitioning for Regularization
    6. 10.6 Summary
    7. 10.7 Exercises
    8. References
  18. Chapter 11. Application Case Study: Advanced MRI Reconstruction
    1. 11.1 Application Background
    2. 11.2 Iterative Reconstruction
    3. 11.3 Computing FHD
    4. 11.4 Final Evaluation
    5. 11.5 Exercises
    6. References
  19. Chapter 12. Application Case Study: Molecular Visualization and Analysis
    1. 12.1 Application Background
    2. 12.2 A Simple Kernel Implementation
    3. 12.3 Thread Granularity Adjustment
    4. 12.4 Memory Coalescing
    5. 12.5 Summary
    6. 12.6 Exercises
    7. References
  20. Chapter 13. Parallel Programming and Computational Thinking
    1. 13.1 Goals of Parallel Computing
    2. 13.2 Problem Decomposition
    3. 13.3 Algorithm Selection
    4. 13.4 Computational Thinking
    5. 13.5 Summary
    6. 13.6 Exercises
    7. References
  21. Chapter 14. An Introduction to OpenCLTM
    1. 14.1 Background
    2. 14.2 Data Parallelism Model
    3. 14.3 Device Architecture
    4. 14.4 Kernel Functions
    5. 14.5 Device Management and Kernel Launch
    6. 14.6 Electrostatic Potential Map in OpenCL
    7. 14.7 Summary
    8. 14.8 Exercises
    9. References
  22. Chapter 15. Parallel Programming with OpenACC
    1. 15.1 OpenACC Versus CUDA C
    2. 15.2 Execution Model
    3. 15.3 Memory Model
    4. 15.4 Basic OpenACC Programs
    5. 15.5 Future Directions of OpenACC
    6. 15.6 Exercises
  23. Chapter 16. Thrust: A Productivity-Oriented Library for CUDA
    1. 16.1 Background
    2. 16.2 Motivation
    3. 16.3 Basic Thrust Features
    4. 16.4 Generic Programming
    5. 16.5 Benefits of Abstraction
    6. 16.6 Programmer Productivity
    7. 16.7 Best Practices
    8. 16.8 Exercises
    9. References
  24. Chapter 17. CUDA FORTRAN
    1. 17.1 CUDA FORTRAN and CUDA C Differences
    2. 17.2 A First CUDA FORTRAN Program
    3. 17.3 Multidimensional Array in CUDA FORTRAN
    4. 17.4 Overloading Host/Device Routines With Generic Interfaces
    5. 17.5 Calling CUDA C Via Iso_C_Binding
    6. 17.6 Kernel Loop Directives and Reduction Operations
    7. 17.7 Dynamic Shared Memory
    8. 17.8 Asynchronous Data Transfers
    9. 17.9 Compilation and Profiling
    10. 17.10 Calling Thrust from CUDA FORTRAN
    11. 17.11 Exercises
  25. Chapter 18. An Introduction to C++ AMP
    1. 18.1 Core C++ AMP Features
    2. 18.2 Details of the C++ AMP Execution Model
    3. 18.3 Managing Accelerators
    4. 18.4 Tiled Execution
    5. 18.5 C++ AMP Graphics Features
    6. 18.6 Summary
    7. 18.7 Exercises
  26. Chapter 19. Programming a Heterogeneous Computing Cluster
    1. 19.1 Background
    2. 19.2 A Running Example
    3. 19.3 MPI Basics
    4. 19.4 MPI Point-to-Point Communication Types
    5. 19.5 Overlapping Computation and Communication
    6. 19.6 MPI Collective Communication
    7. 19.7 Summary
    8. 19.8 Exercises
    9. Reference
  27. Chapter 20. CUDA Dynamic Parallelism
    1. 20.1 Background
    2. 20.2 Dynamic Parallelism Overview
    3. 20.3 Important Details
    4. 20.4 Memory Visibility
    5. 20.5 A Simple Example
    6. 20.6 Runtime Limitations
    7. 20.7 A More Complex Example
    8. 20.8 Summary
    9. Reference
  28. Chapter 21. Conclusion and Future Outlook
    1. 21.1 Goals Revisited
    2. 21.2 Memory Model Evolution
    3. 21.3 Kernel Execution Control Evolution
    4. 21.4 Core Performance
    5. 21.5 Programming Environment
    6. 21.6 Future Outlook
    7. References
  29. Appendix A. Matrix Multiplication Host-Only Version Source Code
    1. Appendix Outline
    2. A.1 matrixmul.cu
    3. A.2 matrixmul_gold.cpp
    4. A.3 matrixmul.h
    5. A.4 assist.h
    6. A.5 Expected Output
  30. Appendix B. GPU Compute Capabilities
    1. Appendix Outline
    2. B.1 GPU Compute Capability Tables
    3. B.2 Memory Coalescing Variations
  31. Index