You are previewing The CUDA Handbook: A Comprehensive Guide to GPU Programming.
O'Reilly logo
The CUDA Handbook: A Comprehensive Guide to GPU Programming

Book Description

The CUDA Handbook begins where CUDA by Example (Addison-Wesley, 2011) leaves off, discussing CUDA hardware and software in greater detail and covering both CUDA 5.0 and Kepler. Every CUDA developer, from the casual to the most sophisticated, will find something here of interest and immediate usefulness. Newer CUDA developers will see how the hardware processes commands and how the driver checks progress; more experienced CUDA developers will appreciate the expert coverage of topics such as the driver API and context migration, as well as the guidance on how best to structure CPU/GPU data interchange and synchronization.

The accompanying open source code–more than 25,000 lines of it, freely available at www.cudahandbook.com–is specifically intended to be reused and repurposed by developers.

Designed to be both a comprehensive reference and a practical cookbook, the text is divided into the following three parts:

Part I, Overview, gives high-level descriptions of the hardware and software that make CUDA possible.

Part II, Details, provides thorough descriptions of every aspect of CUDA, including

  •  Memory

  • Streams and events

  •  Models of execution, including the dynamic parallelism feature, new with CUDA 5.0 and SM 3.5

  • The streaming multiprocessors, including descriptions of all features through SM 3.5

  • Programming multiple GPUs

  • Texturing

The source code accompanying Part II is presented as reusable microbenchmarks and microdemos, designed to expose specific hardware characteristics or highlight specific use cases.

Part III, Select Applications, details specific families of CUDA applications and key parallel algorithms, including

  •  Streaming workloads

  • Reduction

  • Parallel prefix sum (Scan)

  • N-body

  • Image Processing

These algorithms cover the full range of potential CUDA applications.

Table of Contents

  1. About This eBook
  2. Title Page
  3. Copyright Page
  4. Dedication Page
  5. Contents
  6. Preface
  7. Acknowledgments
  8. About the Author
  9. Part I
    1. Chapter 1. Background
      1. 1.1. Our Approach
      2. 1.2. Code
      3. 1.3. Administrative Items
      4. 1.4. Road Map
    2. Chapter 2. Hardware Architecture
      1. 2.1. CPU Configurations
      2. 2.2. Integrated GPUs
      3. 2.3. Multiple GPUs
      4. 2.4. Address Spaces in CUDA
      5. 2.5. CPU/GPU Interactions
      6. 2.6. GPU Architecture
      7. 2.7. Further Reading
    3. Chapter 3. Software Architecture
      1. 3.1. Software Layers
      2. 3.2. Devices and Initialization
      3. 3.3. Contexts
      4. 3.4. Modules and Functions
      5. 3.5. Kernels (Functions)
      6. 3.6. Device Memory
      7. 3.7. Streams and Events
      8. 3.8. Host Memory
      9. 3.9. CUDA Arrays and Texturing
      10. 3.10. Graphics Interoperability
      11. 3.11. The CUDA Runtime and CUDA Driver API
    4. Chapter 4. Software Environment
      1. 4.1. nvcc—CUDA Compiler Driver
      2. 4.2. ptxas—the PTX Assembler
      3. 4.3. cuobjdump
      4. 4.4. nvidia-smi
      5. 4.5. Amazon Web Services
  10. Part II
    1. Chapter 5. Memory
      1. 5.1. Host Memory
      2. 5.2. Global Memory
      3. 5.3. Constant Memory
      4. 5.4. Local Memory
      5. 5.5. Texture Memory
      6. 5.6. Shared Memory
      7. 5.7. Memory Copy
    2. Chapter 6. Streams and Events
      1. 6.1. CPU/GPU Concurrency: Covering Driver Overhead
      2. 6.2. Asynchronous Memcpy
      3. 6.3. CUDA Events: CPU/GPU Synchronization
      4. 6.4. CUDA Events: Timing
      5. 6.5. Concurrent Copying and Kernel Processing
      6. 6.6. Mapped Pinned Memory
      7. 6.7. Concurrent Kernel Processing
      8. 6.8. GPU/GPU Synchronization: cudaStreamWaitEvent()
      9. 6.9. Source Code Reference
    3. Chapter 7. Kernel Execution
      1. 7.1. Overview
      2. 7.2. Syntax
      3. 7.3. Blocks, Threads, Warps, and Lanes
      4. 7.4. Occupancy
      5. 7.5. Dynamic Parallelism
    4. Chapter 8. Streaming Multiprocessors
      1. 8.1. Memory
      2. 8.2. Integer Support
      3. 8.3. Floating-Point Support
      4. 8.4. Conditional Code
      5. 8.5. Textures and Surfaces
      6. 8.6. Miscellaneous Instructions
      7. 8.7. Instruction Sets
    5. Chapter 9. Multiple GPUs
      1. 9.1. Overview
      2. 9.2. Peer-to-Peer
      3. 9.3. UVA: Inferring Device from Address
      4. 9.4. Inter-GPU Synchronization
      5. 9.5. Single-Threaded Multi-GPU
      6. 9.6. Multithreaded Multi-GPU
    6. Chapter 10. Texturing
      1. 10.1. Overview
      2. 10.2. Texture Memory
      3. 10.3. 1D Texturing
      4. 10.4. Texture as a Read Path
      5. 10.5. Texturing with Unnormalized Coordinates
      6. 10.6. Texturing with Normalized Coordinates
      7. 10.7. 1D Surface Read/Write
      8. 10.8. 2D Texturing
      9. 10.9. 2D Texturing: Copy Avoidance
      10. 10.10. 3D Texturing
      11. 10.11. Layered Textures
      12. 10.12. Optimal Block Sizing and Performance
      13. 10.13. Texturing Quick References
  11. Part III
    1. Chapter 11. Streaming Workloads
      1. 11.1. Device Memory
      2. 11.2. Asynchronous Memcpy
      3. 11.3. Streams
      4. 11.4. Mapped Pinned Memory
      5. 11.5. Performance and Summary
    2. Chapter 12. Reduction
      1. 12.1. Overview
      2. 12.2. Two-Pass Reduction
      3. 12.3. Single-Pass Reduction
      4. 12.4. Reduction with Atomics
      5. 12.5. Arbitrary Block Sizes
      6. 12.6. Reduction Using Arbitrary Data Types
      7. 12.7. Predicate Reduction
      8. 12.8. Warp Reduction with Shuffle
    3. Chapter 13. Scan
      1. 13.1. Definition and Variations
      2. 13.2. Overview
      3. 13.3. Scan and Circuit Design
      4. 13.4. CUDA Implementations
      5. 13.5. Warp Scans
      6. 13.6. Stream Compaction
      7. 13.7. References (Parallel Scan Algorithms)
      8. 13.8. Further Reading (Parallel Prefix Sum Circuits)
    4. Chapter 14. N-Body
      1. 14.1. Introduction
      2. 14.2. Naïve Implementation
      3. 14.3. Shared Memory
      4. 14.4. Constant Memory
      5. 14.5. Warp Shuffle
      6. 14.6. Multiple GPUs and Scalability
      7. 14.7. CPU Optimizations
      8. 14.8. Conclusion
      9. 14.9. References and Further Reading
    5. Chapter 15. Image Processing: Normalized Correlation
      1. 15.1. Overview
      2. 15.2. Naïve Texture-Texture Implementation
      3. 15.3. Template in Constant Memory
      4. 15.4. Image in Shared Memory
      5. 15.5. Further Optimizations
      6. 15.6. Source Code
      7. 15.7. Performance and Further Reading
      8. 15.8. Further Reading
  12. Appendix A. The CUDA Handbook Library
    1. A.1. Timing
    2. A.2. Threading
    3. A.3. Driver API Facilities
    4. A.4. Shmoos
    5. A.5. Command Line Parsing
    6. A.6. Error Handling
  13. Glossary / TLA Decoder
  14. Index