O'Reilly logo

Scalable Computing and Communications: Theory and Practice by Lizhe Wang, Albert Y. Zomaya, Samee U. Khan

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

33

–––––––––––––––––––––––

Fault-Tolerance Techniques for Scalable Computing

Pavan Balaji, Darius Buntinas, and Dries Kimpe

33.1    INTRODUCTION AND TRENDS IN LARGE-SCALE COMPUTING SYSTEMS

The largest systems in the world already use close to a million cores. With upcoming systems expected to use tens to hundreds of millions of cores, and exascale systems going up to a billion cores, the number of hardware components these systems would comprise would be staggering. Unfortunately, the reliability of each hardware component is not improving at the same rate as the number of components in the system is growing. Consequently, faults are increasingly becoming common. For the largest supercomputers that will be available over the next decade, faults will become a norm rather than an exception.

Faults are common even today. Memory bit flips and network packet drops, for example, are common on the largest systems today. However, these faults are typically hidden from the user in that the hardware automatically corrects these errors by error correction techniques such as error correction codes (ECCs) and hardware redundancy. While convenient, unfortunately, such techniques are sometimes expensive with respect to cost as well as to performance and power usage. Consequently, researchers are looking at various approaches to alleviate this issue.

Broadly speaking, modern fault-resilience techniques can be classified into three categories:

  1. Hardware Resilience. This category includes techniques ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required