Chapter 23

Addressing hardware reliability challenges in general-purpose GPUs

J. Tan; X. Fu    University of Houston, Houston, TX, United States

Abstract

With these increasing computing power and improved programmability, General-Purpose Computing on GPUs (GPGPUs) emerge as a highly attractive platform for a wide range of HPC applications exhibiting strong data-level or thread-level parallelism. The error detection and fault tolerance on GPUs receive little attention since graphic applications effectively mask errors and have relaxed request on computation correctness. However, HPC applications have rigorous requirements on execution correctness, which makes reliability a growing concern in GPGPU architecture design. This urges the need ...

Get Advances in GPU Research and Practice now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.