Pavan Balaji, Darius Buntinas, and Dries Kimpe
The largest systems in the world already use close to a million cores. With upcoming systems expected to use tens to hundreds of millions of cores, and exascale systems going up to a billion cores, the number of hardware components these systems would comprise would be staggering. Unfortunately, the reliability of each hardware component is not improving at the same rate as the number of components in the system is growing. Consequently, faults are increasingly becoming common. For the largest supercomputers that will be available over the next decade, faults will become a norm rather than an exception.
Faults are common even today. Memory bit flips and network packet drops, for example, are common on the largest systems today. However, these faults are typically hidden from the user in that the hardware automatically corrects these errors by error correction techniques such as error correction codes (ECCs) and hardware redundancy. While convenient, unfortunately, such techniques are sometimes expensive with respect to cost as well as to performance and power usage. Consequently, researchers are looking at various approaches to alleviate this issue.
Broadly speaking, modern fault-resilience techniques can be classified into three categories: