Chapter 7. Techniques to Increase Parallelism
CUDA was designed to exploit the massive parallelism inside the GPU as well as through the use of concurrent streams of execution to utilize multiple GPUs, asynchronous data transfers, and simultaneous kernel execution on a single device. By default, CUDA creates a single stream of execution on a one GPU, which is usually device 0. All data transfers and kernel invocations are queued on this single stream and processed sequentially in the order they were queued. By explicitly creating and using multiple streams of execution, a CUDA programmer can perform more work per unit time to make applications run faster. For example, multiple GPUs can be utilized by simply changing the device with cudaSetDevice() ...

Get CUDA Application Design and Development now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.