Questions

  1. Change the random vector in simple_scalar_multiply_kernel.py so that it is of a length of 10,000, and modify the i index in the definition of the kernel so that it can be used over multiple blocks in the form of a grid. See if you can now launch this kernel over 10,000 threads by setting block and grid parameters to something like block=(100,1,1) and grid=(100,1,1).
  2. In the previous question, we launched a kernel that makes use of 10,000 threads simultaneously; as of 2018, there is no NVIDIA GPU with more than 5,000 cores. Why does this still work and give the expected results?
  3. The naive parallel prefix algorithm has time complexity O(log n) given that we have n or more processors for a dataset of size n. Suppose that we use a ...

Get Hands-On GPU Programming with Python and CUDA now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.