Chapter 7Tuning Instruction-Level Primitives

What's in this chapter?

  • Learning about multiple classes of CUDA instructions and their impact on application behavior
  • Observing the relative accuracy of single- and double-precision floating-point values
  • Experimenting with the performance and accuracy of standard and intrinsic functions
  • Uncovering undefined behavior from unsafe memory accesses
  • Understanding the significance of arithmetic instructions and the consequences of using them improperly

When making the decision to use CUDA for a particular application, the primary motivator is usually the computational throughput of GPUs. As you learned in previous chapters in this book, in order to achieve high throughput on GPUs you need to understand what factors are limiting peak performance. You have already learned about CUDA tools that can help you determine if your workload is sensitive to latency, bandwidth, or arithmetic operations. Based on this understanding you can generally classify applications into two categories:

  • I/O-bound
  • Compute-bound

In this chapter, you will focus on tuning compute-bound workloads. The computational throughput of a processor can be measured by the number of operations it performs in a period of ...

Get Professional CUDA C Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.