Chapter 9. Special Techniques for System Recoverability

Automatic Fault Recognition

In this chapter, we will discuss several techniques for designing more recoverable systems. The first is to detect the presence of a faulty condition as quickly as possible — ideally, before it impacts the entire computing system. If you can localize the fault to a small portion of your system, you have a better chance to recover from it sooner. There are many ways to detect and localize faults. Many are dependent on the specific technologies and platforms you have in place. We will discuss a few examples as illustrations.

Parity Checking Memory

Most computer systems use some form of memory or storage parity checking to detect data integrity problems. With parity ...

Get High Availability: Design, Techniques, and Processes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.