6.5 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA)
CUDA is a software architecture that enables the graphics processing unit (GPU) to be programmed using high-level programming languages such as C and C++. The programmer writes a C program with CUDA extensions, very much like Cilk++ and OpenMP as previously discussed. CUDA requires an NVIDIA GPU like Fermi, GeForce 8XXX/Tesla/Quadro, and so on. Source files must be compiled with the CUDA C compiler NVCC.
A CUDA program uses kernels to operate on the data streams. Examples of data streams are vectors of floating point numbers or a group of frame pixels for video data processing. A kernel is executed in a GPU using parallel threads. CUDA provides three key mechanisms to parallelize programs : thread group hierarchy, shared memories, and barrier synchronization. These mechanisms provide fine-grained parallelism nested within coarse-grained task parallelism.
The following definitions define the terms used in CUDA parlance:
The host or central processing unit (CPU) is the computer that interfaces with the user and controls the device used to execute the data-parallel, compute-intensive portion of an application. The host is responsible for executing the serial portion of the application.
The GPU is a general-purpose graphics processor unit capable of implementing parallel algorithms.
Device is the GPU connected to the host computer to execute the data-parallel, compute-intensive ...