You are previewing First Fault Software Problem Solving: A Guide for Engineers, Managers and Users.
O'Reilly logo
First Fault Software Problem Solving: A Guide for Engineers, Managers and Users

Book Description

Written by a veteran in mission-critical computer system problem resolution, problem prevention, and system recovery, this book discusses solving problems on their FIRST occurrence while emphasizing software supportability and serviceability. Who should read this book? Software professional engineers and managers; End-users, system administrators and their managers; Software engineering students. What will the readers of this book learn? How to optimize use of pre-existing software problem solving features; How to choose the best products to improve first fault problem-solving; How to get the best results when problems occur on outsourced and cloud-placed work; How to choose amongst first-fault tools, second-fault tools, and manual problem solving methods to best advantage for difficult problems; How to be an educated consumer or creator of future problem-solving software. What is the business value of reading this book? Saving money on problem solving resources (servers, storage, network, software, power, space, cooling, personnel); Keeping customers happier since their issues are resolved sooner; Reducing the durations of computer service outages that affect external clients; Decreasing operational overhead and encouraging sustainable, higher-performing organizations and enterprises through best problem-solving practices. What else is special about this book? 21 original illustrations to feed the soul and tickle the funny-bone; 21 thought-provoking quotes to feed the intellect and the spirit; An extensive bibliography to aid in clarification and personal growth.

Table of Contents

  1. Copyright
  2. Preface
  3. Dedication
  4. Acknowledgements
  5. About the Author
  6. Foundations
    1. Introduction
      1. Shocked!
      2. Mainframe mindset
      3. When are problems solved?
      4. Airplane Flight Data Recorders
      5. Trace Tables
      6. Mainframe MVS serviceability design principles
      7. Capturing problem data
      8. Who Solves the Problem?
      9. Why is first fault software problem solving important?
      10. Where? What? When? Should problems be solved on their first occurrence?
      11. Meaning of first-fault software problem-solving
      12. What are the benefits of reading this book?
    2. Benefits
      1. Sure, skip this chapter if you never need to …
      2. Solve problems faster
      3. Conserve resources
      4. Save the corporation
      5. Perform a disaster recovery
      6. Counter-arguments for nonbelievers
      7. …Our environment is very stable
      8. … We use clustering and failover, or other fault-tolerant solutions
      9. …We use outsourcing
      10. …We use cloud computing
      11. …The performance impact is too great for us
    3. Do-Overs
      1. "Do-overs?!"
      2. The problems with performing problem-recreation are...
      3. System "YUK"
      4. System TROP
  7. Technologies
    1. Types of faults, tool classifications and other design issues
      1. Service Points
      2. Types of faults
      3. Synchronous errors
      4. Abort Codes and ABEND codes
      5. Stop Codes and Wait-state codes
      6. User storage dumps
      7. System Area Storage Dumps
      8. Crash dumps and standalone dumps
      9. Hypervisor storage dumps
      10. I/O Error
      11. Hardware error
      12. Asynchronous errors
      13. Errors not immediately detected at service points
      14. Hang
      15. Incorrect output
      16. Performance Problem
      17. User error
      18. Aging problem
      19. Memory leak
      20. Storage overlay
      21. Speed of problem-solving
      22. Eventually solving a problem on its first occurrence
      23. Software tool general classifications
      24. Serviceability rating (SR)
      25. Serviceability percentage (SP)
      26. Serviceability time (ST)
      27. The average American family has 2.3 children ... ... but none of them have 2.3 children...
      28. Collecting data to solve problems
      29. …tracing
    2. Software Service Tools
      1. Messages
      2. Problem Data Collector
      3. Version number, rev number or "About"…
      4. First symptoms of soft errors
      5. "Black box" data recorders – traces
      6. Generalized instruction-flow tracing
      7. Higher-Level, Application-Level and Component-Level Traces
      8. IBM z/OS Component Trace
      9. Storage dumps
      10. Performance Monitors
      11. Error Records
      12. Symptom-Solution Databases
      13. Automation Tools
      14. Phone-home and internet-connected notification
      15. Microsoft's Automatic Problem-Solving In The Cloud
    3. What Users Can and Should Do
      1. Prepare before you use the product in "production"
    4. Creating Software with First Fault Problem Solving Capability
      1. Defensive Programming, and being even more defensive than that
      2. Concepts
      3. Designers
      4. Developers
      5. Testers
      6. Testing the serviceability of a product
      7. Management
    5. The Special Needs of Hand-Held Computers, Cell-Phones, PDAs, and Other Small Systems
      1. Hardware and software self-service
      2. The urgent needs of the vendor service organization
      3. Hardware diagnostic information
      4. Hitting reset… often…
      5. Servicing a very large install base
      6. The crash log
    6. Commercially-Available First-Fault Problem-Solving Tools
      1. Purchase it (buy!) or program it (build!)?
      2. Purchase it (Buy!)
      3. Servicelink by Axeda
      4. Alarmpoint by BMC
      5. Problem externals data-gathering
      6. Recording User Screen Contents – "Externals tracing"
      7. Loglogic by Loglogic
      8. LogRhythm by LogRhythm
      9. Summary
    7. "Second Fault" Tools
      1. First-fault tools can fail to solve a problem
      2. How yucky is system YUK?
      3. First-fault vs. second fault problem solving
      4. Hacking and whacking vs. scripting
      5. When do I set a second-fault trap?
      6. Help from the hardware
      7. Use of Virtual Machines/hypervisors
    8. Maximizing the Value of Diagnostic Data
      1. Write things down
      2. What has changed?
      3. Troubleshooting techniques: Swap parts, and see if the problem's behavior changes.
      4. Given, To Find, Process
      5. 5 W's: Who, What, Where, When, Why
      6. Polya's techniques
      7. Brainstorming
      8. Kepner-Tregoe
      9. All Together Now
  8. The Future
    1. Leading Edge Software Tools
      1. Instant Replay from Replay Solutions
      2. ConicIT Mainframe Performance Analysis
      3. Cloud Computing Tools
      4. RIGHTNOW Cloud Computing Monitor
      5. The Amazon Cloud Dashboard
      6. CloudWatch from Amazon
      7. InternetPulse
      8. Twitter for generalized problem reporting
    2. Unanswered Questions
      1. Questions in general
      2. Questions you can answer at your site
    3. Directions and Suggestions
      1. Science Fiction
      2. Data Collection
      3. A wish for Intel
      4. Signature Creation
      5. Notification
      6. Signature matching to known problems
      7. Deep Analysis of Data
      8. Complex Problem Solving
    4. Summary
      1. Taking stock of where we are now
      2. Is the airplane black-box due for changes?
      3. Taking Action in your Organization
  9. Bibliography
  10. Notes