You are previewing Architecture Design for Soft Errors.
O'Reilly logo
Architecture Design for Soft Errors

Book Description

This book provides a comprehensive description of the architetural techniques to tackle the soft error problem. It covers the new methodologies for quantitative analysis of soft errors as well as novel, cost-effective architectural techniques to mitigate them. To provide readers with a better grasp of the broader problem deffinition and solution space, this book also delves into the physics of soft errors and reviews current circuit and software mitigation techniques.

TABLE OF CONTENTS
Chapter 1: Introduction
Chapter 2: Device- and Circuit-Level Modeling, Measurement, and Mitigation
Chapter 3: Architectural Vulnerability Analysis
Chapter 4: Advanced Architectural Vulnerability Analysis
Chapter 5: Error Coding Techniques
Chapter 6: Fault Detection via Redundant Execution
Chapter 7: Hardware Error Recovery
Chapter 8: Software Detection and Recovery

* Helps readers build-in fault tolerance to the billions of microchips produced each year, all of which are subject to soft errors
* Shows readers how to quantify their soft error reliability
* Provides state-of-the-art techniques to protect against soft errors

Table of Contents

  1. Cover
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Foreword
  7. Preface
  8. Chapter 1: Introduction
    1. 1.1 Overview
    2. 1.2 Faults
    3. 1.3 Errors
    4. 1.4 Metrics
    5. 1.5 Dependability Models
    6. 1.6 Permanent Faults in Complementary Metal Oxide Semiconductor Technology
    7. 1.7 Radiation-Induced Transient Faults in CMOS Transistors
    8. 1.8 Architectural Fault Models for Alpha Particle and Neutron Strikes
    9. 1.9 Silent Data Corruption and Detected Unrecoverable Error
    10. 1.10 Soft Error Scaling Trends
    11. 1.11 Summary
    12. 1.12 Historical Anecdote
  9. Chapter 2: Device- and Circuit-Level Modeling, Measurement, and Mitigation
    1. 2.1 Overview
    2. 2.2 Modeling Circuit-Level SERs
    3. 2.3 Measurement
    4. 2.4 Mitigation Techniques
    5. 2.5 Summary
    6. 2.6 Historical Anecdote
  10. Chapter 3: Architectural Vulnerability Analysis
    1. 3.1 Overview
    2. 3.2 AVF Basics
    3. 3.3 Does a Bit Matter?
    4. 3.4 SDC and DUE Equations
    5. 3.5 ACE Principles
    6. 3.6 Microarchitectural Un-ACE Bits
    7. 3.7 Architectural Un-ACE Bits
    8. 3.8 AVF Equations for a Hardware Structure
    9. 3.9 Computing AVF with Little’s Law
    10. 3.10 Computing AVF with a Performance Model
    11. 3.11 ACE Analysis Using the Point-of-Strike Fault Model
    12. 3.12 ACE Analysis Using the Propagated Fault Model
    13. 3.13 Summary
    14. 3.14 Historical Anecdote
  11. Chapter 4: Advanced Architectural Vulnerability Analysis
    1. 4.1 Overview
    2. 4.2 Lifetime Analysis of RAM Arrays
    3. 4.3 Lifetime Analysis of CAM Arrays
    4. 4.4 Effect of Cooldown in Lifetime Analysis
    5. 4.5 AVF Results for Cache, Data Translation Buffer, and Store Buffer
    6. 4.6 Computing AVFs Using SFI into an RTL Model
    7. 4.7 Case Study of SFI
    8. 4.8 Summary
    9. 4.9 Historical Anecdote
  12. Chapter 5: Error Coding Techniques
    1. 5.1 Overview
    2. 5.2 Fault Detection and ECC for State Bits
    3. 5.3 Error Detection Codes for Execution Units
    4. 5.4 Implementation Overhead of Error Detection and Correction Codes
    5. 5.5 Scrubbing Analysis
    6. 5.6 Detecting False Errors
    7. 5.7 Hardware Assertions
    8. 5.8 Machine Check Architecture
    9. 5.9 Summary
    10. 5.10 Historical Anecdote
  13. Chapter 6: Fault Detection via Redundant Execution
    1. 6.1 Overview
    2. 6.2 Sphere of Replication
    3. 6.3 Fault Detection via Cycle-by-Cycle Lockstepping
    4. 6.4 Lockstepping in the Hewlett-Packard NonStop Himalaya Architecture
    5. 6.5 Lockstepping in the IBM Z-series Processors
    6. 6.6 Fault Detection via RMT
    7. 6.7 RMT in the Marathon Endurance Server
    8. 6.8 RMT in the Hewlett-Packard NonStop® Advanced Architecture
    9. 6.9 RMT Within a Single-Processor Core
    10. 6.10 RMT in a Multicore Architecture
    11. 6.11 DIVA: RMT Using Specialized Checker Processor
    12. 6.12 RMT Enhancements
    13. 6.13 Summary
    14. 6.14 Historical Anecdote
  14. Chapter 7: Hardware Error Recovery
    1. 7.1 Overview
    2. 7.2 Classification of Hardware Error Recovery Schemes
    3. 7.3 Forward Error Recovery
    4. 7.4 Backward Error Recovery with Fault Detection before Register Commit
    5. 7.5 Backward Error Recovery with Fault Detection before Memory Commit
    6. 7.6 Backward Error Recovery with Fault Detection before I/O Commit
    7. 7.7 Backward Error Recovery with Fault Detection after I/O Commit
    8. 7.8 Summary
    9. 7.9 Historical Anecdote
  15. Chapter 8: Software Detection and Recovery
    1. 8.1 Overview
    2. 8.2 Fault Detection Using SIS
    3. 8.3 Fault Detection Using Software RMT
    4. 8.4 Fault Detection Using Hybrid RMT
    5. 8.5 Fault Detection Using RVMs
    6. 8.6 Application-Level Recovery
    7. 8.7 OS-Level and VMM-Level Recoveries
    8. 8.8 Summary
  16. Index