Architecture Design for Soft Errors

Book description

Architecture Design for Soft Errors provides a comprehensive description of the architectural techniques to tackle the soft error problem. It covers the new methodologies for quantitative analysis of soft errors as well as novel, cost-effective architectural techniques to mitigate them.

To provide readers with a better grasp of the broader problem definition and solution space, this book also delves into the physics of soft errors and reviews current circuit and software mitigation techniques. There are a number of different ways this book can be read or used in a course: as a complete course on architecture design for soft errors covering the entire book; a short course on architecture design for soft errors; and as a reference book on classical fault-tolerant machines.

This book is recommended for practitioners in semi-conductor industry, researchers and developers in computer architecture, advanced graduate seminar courses on soft errors, and (iv) as a reference book for undergraduate courses in computer architecture.

  • Helps readers build-in fault tolerance to the billions of microchips produced each year, all of which are subject to soft errors
  • Shows readers how to quantify their soft error reliability
  • Provides state-of-the-art techniques to protect against soft errors

Table of contents

  1. Cover
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Foreword
  7. Preface
  8. Chapter 1: Introduction
    1. 1.1 Overview
    2. 1.2 Faults
    3. 1.3 Errors
    4. 1.4 Metrics
    5. 1.5 Dependability Models
    6. 1.6 Permanent Faults in Complementary Metal Oxide Semiconductor Technology
    7. 1.7 Radiation-Induced Transient Faults in CMOS Transistors
    8. 1.8 Architectural Fault Models for Alpha Particle and Neutron Strikes
    9. 1.9 Silent Data Corruption and Detected Unrecoverable Error
    10. 1.10 Soft Error Scaling Trends
    11. 1.11 Summary
    12. 1.12 Historical Anecdote
  9. Chapter 2: Device- and Circuit-Level Modeling, Measurement, and Mitigation
    1. 2.1 Overview
    2. 2.2 Modeling Circuit-Level SERs
    3. 2.3 Measurement
    4. 2.4 Mitigation Techniques
    5. 2.5 Summary
    6. 2.6 Historical Anecdote
  10. Chapter 3: Architectural Vulnerability Analysis
    1. 3.1 Overview
    2. 3.2 AVF Basics
    3. 3.3 Does a Bit Matter?
    4. 3.4 SDC and DUE Equations
    5. 3.5 ACE Principles
    6. 3.6 Microarchitectural Un-ACE Bits
    7. 3.7 Architectural Un-ACE Bits
    8. 3.8 AVF Equations for a Hardware Structure
    9. 3.9 Computing AVF with Little’s Law
    10. 3.10 Computing AVF with a Performance Model
    11. 3.11 ACE Analysis Using the Point-of-Strike Fault Model
    12. 3.12 ACE Analysis Using the Propagated Fault Model
    13. 3.13 Summary
    14. 3.14 Historical Anecdote
  11. Chapter 4: Advanced Architectural Vulnerability Analysis
    1. 4.1 Overview
    2. 4.2 Lifetime Analysis of RAM Arrays
    3. 4.3 Lifetime Analysis of CAM Arrays
    4. 4.4 Effect of Cooldown in Lifetime Analysis
    5. 4.5 AVF Results for Cache, Data Translation Buffer, and Store Buffer
    6. 4.6 Computing AVFs Using SFI into an RTL Model
    7. 4.7 Case Study of SFI
    8. 4.8 Summary
    9. 4.9 Historical Anecdote
  12. Chapter 5: Error Coding Techniques
    1. 5.1 Overview
    2. 5.2 Fault Detection and ECC for State Bits
    3. 5.3 Error Detection Codes for Execution Units
    4. 5.4 Implementation Overhead of Error Detection and Correction Codes
    5. 5.5 Scrubbing Analysis
    6. 5.6 Detecting False Errors
    7. 5.7 Hardware Assertions
    8. 5.8 Machine Check Architecture
    9. 5.9 Summary
    10. 5.10 Historical Anecdote
  13. Chapter 6: Fault Detection via Redundant Execution
    1. 6.1 Overview
    2. 6.2 Sphere of Replication
    3. 6.3 Fault Detection via Cycle-by-Cycle Lockstepping
    4. 6.4 Lockstepping in the Hewlett-Packard NonStop Himalaya Architecture
    5. 6.5 Lockstepping in the IBM Z-series Processors
    6. 6.6 Fault Detection via RMT
    7. 6.7 RMT in the Marathon Endurance Server
    8. 6.8 RMT in the Hewlett-Packard NonStop® Advanced Architecture
    9. 6.9 RMT Within a Single-Processor Core
    10. 6.10 RMT in a Multicore Architecture
    11. 6.11 DIVA: RMT Using Specialized Checker Processor
    12. 6.12 RMT Enhancements
    13. 6.13 Summary
    14. 6.14 Historical Anecdote
  14. Chapter 7: Hardware Error Recovery
    1. 7.1 Overview
    2. 7.2 Classification of Hardware Error Recovery Schemes
    3. 7.3 Forward Error Recovery
    4. 7.4 Backward Error Recovery with Fault Detection before Register Commit
    5. 7.5 Backward Error Recovery with Fault Detection before Memory Commit
    6. 7.6 Backward Error Recovery with Fault Detection before I/O Commit
    7. 7.7 Backward Error Recovery with Fault Detection after I/O Commit
    8. 7.8 Summary
    9. 7.9 Historical Anecdote
  15. Chapter 8: Software Detection and Recovery
    1. 8.1 Overview
    2. 8.2 Fault Detection Using SIS
    3. 8.3 Fault Detection Using Software RMT
    4. 8.4 Fault Detection Using Hybrid RMT
    5. 8.5 Fault Detection Using RVMs
    6. 8.6 Application-Level Recovery
    7. 8.7 OS-Level and VMM-Level Recoveries
    8. 8.8 Summary
  16. Index

Product information

  • Title: Architecture Design for Soft Errors
  • Author(s): Shubu Mukherjee
  • Release date: August 2011
  • Publisher(s): Morgan Kaufmann
  • ISBN: 9780080558325