You are previewing Patterns for Fault Tolerant Software.
O'Reilly logo
Patterns for Fault Tolerant Software

Book Description

Software patterns have revolutionized the way developer's and architects think about how software is designed, built and documented.

This new title in Wiley's prestigious Series in Software Design Patterns presents proven techniques to achieve patterns for fault tolerant software. This is a key reference for experts seeking to select a technique appropriate for a given system.

Readers are guided from concepts and terminology, through common principles and methods, to advanced techniques and practices in the development of software systems. References will provide access points to the key literature, including descriptions of exemplar applications of each technique.

Organized into a collection of software techniques, specific techniques can be easily found with sufficient detail to allow appropriate choices for the system being designed.

Table of Contents

  1. Copyright
  2. Preface
    1. Who this Book is For
    2. How to Use this Book
    3. Patterns
      1. Pattern History
      2. What is a Pattern?
      3. Reading a Pattern
    4. A Pattern Language for Fault Tolerance
    5. A Word about Examples
      1. Telecommunications Systems
      2. Space Programs
  3. Acknowledgements
    1. Pattern Origins and Earlier Versions
  4. Introduction
    1. An Imperfect World
  5. 1. Introduction to Fault Tolerance
    1. 1.1. Fault -> Error -> Failure
      1. 1.1.1. Examples of Fault -> Error -> Failure
    2. 1.2. Failure Perception [Lap91][Kop97]
    3. 1.3. Single Faults
    4. 1.4. Examples of How Vocabulary Makes a Difference
    5. 1.5. Coverage
    6. 1.6. Reliability
      1. 1.6.1. Reliability Examples
        1. 1.6.1.1. Mars Landers
        2. 1.6.1.2. Airplane Navigation System
        3. 1.6.1.3. Measuring Reliability
    7. 1.7. Availability
      1. 1.7.1. Availability Examples
    8. 1.8. Dependability
    9. 1.9. Hardware Reliability
    10. 1.10. Reliability Engineering and Analysis
    11. 1.11. Performance
  6. 2. Fault Tolerant Mindset
    1. 2.1. Fault Tolerant Mindset
    2. 2.2. Design Tradeoffs
    3. 2.3. Quality v. Fault Tolerance
    4. 2.4. Keep It Simple
    5. 2.5. Incremental Additions of Reliability
    6. 2.6. Defensive Programming Techniques
      1. 2.6.1. Faults in Fault Tolerance Code
      2. 2.6.2. Memory Corruption
      3. 2.6.3. Data Structure Design
      4. 2.6.4. Design for Maintainability
      5. 2.6.5. Coding Standards
      6. 2.6.6. Redundancy
      7. 2.6.7. Static Analysis Tools
      8. 2.6.8. N-Version Programming
      9. 2.6.9. Redundant Disks [PGK88][MS00]
    7. 2.7. The Role of Verification
    8. 2.8. Fault Insertion Testing
    9. 2.9. Fault Tolerant Design Methodology
  7. 3. Introduction to the Patterns
    1. 3.1. Shared Context for These Patterns
      1. 3.1.1. Real-Time
      2. 3.1.2. High Reliability
        1. 3.1.2.1. High Availability
        2. 3.1.2.2. Failure Rate Requirements
      3. 3.1.3. State or Stateless
      4. 3.1.4. External Observers
      5. 3.1.5. Integrated Fault Tolerance
      6. 3.1.6. Fault Tolerance is Not Free
      7. 3.1.7. Long Lived Systems
    2. 3.2. Terminology
  8. 4. Architectural Patterns
    1. 4.1. 1. Units of Mitigation
    2. 4.2. 2. Correcting Audits
    3. 4.3. 3. Redundancy
    4. 4.4. 4. Recovery Blocks
    5. 4.5. 5. Minimize Human Intervention
    6. 4.6. 6. Maximize Human Participation
    7. 4.7. 7. Maintenance Interface
    8. 4.8. 8. Someone in Charge
    9. 4.9. 9. Escalation
    10. 4.10. 10. Fault Observer
    11. 4.11. 11. Software Update
  9. 5. Detection Patterns
    1. 5.1. 12. Fault Correlation
    2. 5.2. 13. Error Containment Barrier
    3. 5.3. 14. Complete Parameter Checking
    4. 5.4. 15. System Monitor
    5. 5.5. 16. Heartbeat
    6. 5.6. 17. Acknowledgement
    7. 5.7. 18. Watchdog
    8. 5.8. 19. Realistic Threshold
    9. 5.9. 20. Existing Metrics
    10. 5.10. 21. Voting
    11. 5.11. 22. Routine Maintenance
    12. 5.12. 23. Routine Exercises
    13. 5.13. 24. Routine Audits
    14. 5.14. 25. Checksum
    15. 5.15. 26. Riding Over Transients
    16. 5.16. 27. Leaky Bucket Counter
  10. 6. Error Recovery Patterns
    1. 6.1. 28. Quarantine
    2. 6.2. 29. Concentrated Recovery
    3. 6.3. 30. Error Handler
    4. 6.4. 31. Restart
    5. 6.5. 32. Rollback
    6. 6.6. 33. Roll-Forward
    7. 6.7. 34. Return to Reference Point
    8. 6.8. 35. Limit Retries
    9. 6.9. 36. Failover
    10. 6.10. 37. Checkpoint
    11. 6.11. 38. What to Save
    12. 6.12. 39. Remote Storage
    13. 6.13. 40. Individuals Decide Timing
    14. 6.14. 41. Data Reset
  11. 7. Error Mitigation Patterns
    1. 7.1. 42. Overload Toolboxes
    2. 7.2. 43. Deferrable Work
    3. 7.3. 44. Reassess Overload Decision
    4. 7.4. 45. Equitable Resource Allocation
    5. 7.5. 46. Queue for Resources
    6. 7.6. 47. Expansive Automatic Controls
    7. 7.7. 48. Protective Automatic Controls
    8. 7.8. 49. Shed Load
    9. 7.9. 50. Final Handling
    10. 7.10. 51. Share the Load
    11. 7.11. 52. Shed Work at Periphery
    12. 7.12. 53. Slow it Down
    13. 7.13. 54. Finish Work in Progress
    14. 7.14. 55. Fresh Work Before Stale
    15. 7.15. 56. Marked Data
    16. 7.16. 57. Error Correcting Code
  12. 8. Fault Treatment Patterns
    1. 8.1. 58. Let Sleeping Dogs Lie
    2. 8.2. 59. Reintegration
    3. 8.3. 60. Reproducible Error
    4. 8.4. 61. Small Patches
    5. 8.5. 62. Root Cause Analysis
    6. 8.6. 63. Revise Procedure
  13. Conclusion
    1. A Pattern Language for Fault Tolerant Software
    2. A Presence Server Example
      1. Non-functional Requirements
      2. Implementation Choices
    3. Designing for Fault Tolerance
      1. Step 1: Assess the Things that can go Wrong
      2. Step 2: Decide how to Mitigate the Risks
        1. Applying the Language
      3. Step 3: Identifying Redundancy
      4. Step 4: Architectural Design Decisions
      5. Step 5: Risk Mitigation Capabilities
      6. Step 6: Human Computer Interactions
    4. Software Structure
  14. References and Bibliography
  15. A. Appendices
    1. A.1. Patterns for Fault Tolerant Software Thumbnails
    2. A.2. External Pattern Thumbnails