Parallel scan and reduction kernel basics

Let's look at a basic function in PyCUDA that reproduces the functionality of reduce—InclusiveScanKernel. (You can find the code under the filename.) Let's execute a basic example that sums a small list of numbers on the GPU:

import numpy as npimport pycuda.autoinitfrom pycuda import gpuarrayfrom pycuda.scan import InclusiveScanKernelseq = np.array([1,2,3,4],dtype=np.int32)seq_gpu = gpuarray.to_gpu(seq)sum_gpu = InclusiveScanKernel(np.int32, "a+b")print sum_gpu(seq_gpu).get()print np.cumsum(seq)

We construct our kernel by first specifying the input/output type (here, NumPy int32) and in the string, "a+b". Here, InclusiveScanKernel sets up elements named a and b in the GPU space ...

