As we discussed earlier, kernel is just similar to a C function. Each work item will execute this function on the device. We here discuss different optimization strategies and implementations of kernels based on them. In this chapter we present matrix multiplication example to illustrate those optimization strategies with few advantages and disadvantages of them. We need to keep in mind that all the techniques are not applicable to all the problems and also, unfortunately, sometimes they are even in conflict.
For the sake of simplicity we take two square matrices called
B to multiply (each 1024 by 1024) as input and as a result get a square matrix say
C of same size (1024 by 1024). ...