Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo
The CUDA Handbook: A Comprehensive Guide to GPU Programming

Book Description

The CUDA Handbook begins where CUDA by Example (Addison-Wesley, 2011) leaves off, discussing CUDA hardware and software in greater detail and covering both CUDA 5.0 and Kepler. Every CUDA developer, from the casual to the most sophisticated, will find something here of interest and immediate usefulness. Newer CUDA developers will see how the hardware processes commands and how the driver checks progress; more experienced CUDA developers will appreciate the expert coverage of topics such as the driver API and context migration, as well as the guidance on how best to structure CPU/GPU data interchange and synchronization.

The accompanying open source code–more than 25,000 lines of it, freely available at www.cudahandbook.com–is specifically intended to be reused and repurposed by developers.

Designed to be both a comprehensive reference and a practical cookbook, the text is divided into the following three parts:

Part I, Overview, gives high-level descriptions of the hardware and software that make CUDA possible.

Part II, Details, provides thorough descriptions of every aspect of CUDA, including

  •  Memory

  • Streams and events

  •  Models of execution, including the dynamic parallelism feature, new with CUDA 5.0 and SM 3.5

  • The streaming multiprocessors, including descriptions of all features through SM 3.5

  • Programming multiple GPUs

  • Texturing

  • The source code accompanying Part II is presented as reusable microbenchmarks and microdemos, designed to expose specific hardware characteristics or highlight specific use cases.

    Part III, Select Applications, details specific families of CUDA applications and key parallel algorithms, including

  •  Streaming workloads

  • Reduction

  • Parallel prefix sum (Scan)

  • N-body

  • Image Processing

  • These algorithms cover the full range of potential CUDA applications.

    Table of Contents

    1. About This eBook
    2. Title Page
    3. Copyright Page
    4. Dedication Page
    5. Contents
    6. Preface
    7. Acknowledgments
    8. About the Author
    9. Part I
      1. Chapter 1. Background
        1. 1.1. Our Approach
        2. 1.2. Code
        3. 1.3. Administrative Items
        4. 1.4. Road Map
      2. Chapter 2. Hardware Architecture
        1. 2.1. CPU Configurations
        2. 2.2. Integrated GPUs
        3. 2.3. Multiple GPUs
        4. 2.4. Address Spaces in CUDA
        5. 2.5. CPU/GPU Interactions
        6. 2.6. GPU Architecture
        7. 2.7. Further Reading
      3. Chapter 3. Software Architecture
        1. 3.1. Software Layers
        2. 3.2. Devices and Initialization
        3. 3.3. Contexts
        4. 3.4. Modules and Functions
        5. 3.5. Kernels (Functions)
        6. 3.6. Device Memory
        7. 3.7. Streams and Events
        8. 3.8. Host Memory
        9. 3.9. CUDA Arrays and Texturing
        10. 3.10. Graphics Interoperability
        11. 3.11. The CUDA Runtime and CUDA Driver API
      4. Chapter 4. Software Environment
        1. 4.1. nvcc—CUDA Compiler Driver
        2. 4.2. ptxas—the PTX Assembler
        3. 4.3. cuobjdump
        4. 4.4. nvidia-smi
        5. 4.5. Amazon Web Services
    10. Part II
      1. Chapter 5. Memory
        1. 5.1. Host Memory
        2. 5.2. Global Memory
        3. 5.3. Constant Memory
        4. 5.4. Local Memory
        5. 5.5. Texture Memory
        6. 5.6. Shared Memory
        7. 5.7. Memory Copy
      2. Chapter 6. Streams and Events
        1. 6.1. CPU/GPU Concurrency: Covering Driver Overhead
        2. 6.2. Asynchronous Memcpy
        3. 6.3. CUDA Events: CPU/GPU Synchronization
        4. 6.4. CUDA Events: Timing
        5. 6.5. Concurrent Copying and Kernel Processing
        6. 6.6. Mapped Pinned Memory
        7. 6.7. Concurrent Kernel Processing
        8. 6.8. GPU/GPU Synchronization: cudaStreamWaitEvent()
        9. 6.9. Source Code Reference
      3. Chapter 7. Kernel Execution
        1. 7.1. Overview
        2. 7.2. Syntax
        3. 7.3. Blocks, Threads, Warps, and Lanes
        4. 7.4. Occupancy
        5. 7.5. Dynamic Parallelism
      4. Chapter 8. Streaming Multiprocessors
        1. 8.1. Memory
        2. 8.2. Integer Support
        3. 8.3. Floating-Point Support
        4. 8.4. Conditional Code
        5. 8.5. Textures and Surfaces
        6. 8.6. Miscellaneous Instructions
        7. 8.7. Instruction Sets
      5. Chapter 9. Multiple GPUs
        1. 9.1. Overview
        2. 9.2. Peer-to-Peer
        3. 9.3. UVA: Inferring Device from Address
        4. 9.4. Inter-GPU Synchronization
        5. 9.5. Single-Threaded Multi-GPU
        6. 9.6. Multithreaded Multi-GPU
      6. Chapter 10. Texturing
        1. 10.1. Overview
        2. 10.2. Texture Memory
        3. 10.3. 1D Texturing
        4. 10.4. Texture as a Read Path
        5. 10.5. Texturing with Unnormalized Coordinates
        6. 10.6. Texturing with Normalized Coordinates
        7. 10.7. 1D Surface Read/Write
        8. 10.8. 2D Texturing
        9. 10.9. 2D Texturing: Copy Avoidance
        10. 10.10. 3D Texturing
        11. 10.11. Layered Textures
        12. 10.12. Optimal Block Sizing and Performance
        13. 10.13. Texturing Quick References
    11. Part III
      1. Chapter 11. Streaming Workloads
        1. 11.1. Device Memory
        2. 11.2. Asynchronous Memcpy
        3. 11.3. Streams
        4. 11.4. Mapped Pinned Memory
        5. 11.5. Performance and Summary
      2. Chapter 12. Reduction
        1. 12.1. Overview
        2. 12.2. Two-Pass Reduction
        3. 12.3. Single-Pass Reduction
        4. 12.4. Reduction with Atomics
        5. 12.5. Arbitrary Block Sizes
        6. 12.6. Reduction Using Arbitrary Data Types
        7. 12.7. Predicate Reduction
        8. 12.8. Warp Reduction with Shuffle
      3. Chapter 13. Scan
        1. 13.1. Definition and Variations
        2. 13.2. Overview
        3. 13.3. Scan and Circuit Design
        4. 13.4. CUDA Implementations
        5. 13.5. Warp Scans
        6. 13.6. Stream Compaction
        7. 13.7. References (Parallel Scan Algorithms)
        8. 13.8. Further Reading (Parallel Prefix Sum Circuits)
      4. Chapter 14. N-Body
        1. 14.1. Introduction
        2. 14.2. Naïve Implementation
        3. 14.3. Shared Memory
        4. 14.4. Constant Memory
        5. 14.5. Warp Shuffle
        6. 14.6. Multiple GPUs and Scalability
        7. 14.7. CPU Optimizations
        8. 14.8. Conclusion
        9. 14.9. References and Further Reading
      5. Chapter 15. Image Processing: Normalized Correlation
        1. 15.1. Overview
        2. 15.2. Naïve Texture-Texture Implementation
        3. 15.3. Template in Constant Memory
        4. 15.4. Image in Shared Memory
        5. 15.5. Further Optimizations
        6. 15.6. Source Code
        7. 15.7. Performance and Further Reading
        8. 15.8. Further Reading
    12. Appendix A. The CUDA Handbook Library
      1. A.1. Timing
      2. A.2. Threading
      3. A.3. Driver API Facilities
      4. A.4. Shmoos
      5. A.5. Command Line Parsing
      6. A.6. Error Handling
    13. Glossary / TLA Decoder
    14. Index