O'Reilly logo

OpenCL Parallel Programming Development Cookbook by Raymond Tay

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Reducing global memory via shared memory data prefetching in matrix multiplication

Our revised matrix multiplication algorithm appears to be pretty good but it isn't quite there yet. The algorithm is still making a lot of references to matrix B over global memory and we can actually reduce this traffic by prefetching the data. You may not have noticed, but the concept of prefetching, which is to keep the cache "hot" (an idea borrowed from the CPU). A CPU typically has a good size of data and instruction caches (which are really hardware registers), so that the processor can take advantage of the spatial and temporal localities of the data. How does this concept map into other OpenCL devices, for example, the GPU?

Every GPU that is an OpenCL compliant ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required