You are previewing UNIX® Fault Management: A Guide for System Administration.
O'Reilly logo
UNIX® Fault Management: A Guide for System Administration

Book Description


2652E-6

Maximize UNIX system integrity and availability in mission-critical environments!

If you're responsible for maintaining the integrity and availability of a mission-critical UNIX system, then you need UNIX Fault Management: A Guide for System Administrators, the first book that brings together all of the monitoring and fault management information. Expert UNIX system management engineers Brad Stone and Julie Symons show you exactly how to implement appropriate, cost-effective system monitoring on any UNIX server -- including systems configured as high availability clusters. You'll learn how to:

  • Plan for-and establish-cost-effective, reliable system monitoring procedures

  • Monitor systems, disks, networks, applications, and databases

  • Detect, investigate, and recover from server problems

  • Implement best practices for high availability in enterprise-class UNIX installations-including clusters

  • Take advantage of key fault management trends, new standards, and new technologies

  • This book contains detailed descriptions of fault monitoring tools and monitoring frameworks to help you make better purchasing decisions. You'll also find a handy quick reference of monitoring tasks and techniques for operators -- including specific, step-by-step recovery solutions. If you can't afford one nanosecond more downtime than necessary, you can't afford to be without UNIX Fault Management.

    Table of Contents

    1. Copyright
    2. Preface
    3. Acknowledgments
    4. Analyzing the Role of System Operators
      1. Trends in System Operations
    5. Enumerating Possible Events
      1. Defining Fault Management
      2. Event Categories
    6. Using Monitoring Frameworks
      1. Distinguishing Monitoring Frameworks
      2. IT/Operations
      3. Unicenter TNG
      4. Event Monitoring Service
      5. PLATINUM ProVision
      6. BMC PATROL
      7. MeasureWare
    7. Monitoring the System
      1. Identifying Important System Monitoring Categories
      2. Using Standard Commands and Tools
      3. Using System Instrumentation
      4. Using Graphical Status Monitors
      5. Using Event Monitoring Tools
      6. Security Monitoring
      7. Using Diagnostic Tools
      8. Monitoring System Peripherals
      9. Collecting System Performance Data
      10. Using System Performance Data
      11. Avoiding System Problems
      12. Recovering from System Problems
      13. Comparing System Monitoring Tools
      14. Case Study: Recovering from Memory Faults
    8. Monitoring the Disks
      1. Identifying Important Disk Monitoring Categories
      2. Using Standard Commands and Tools
      3. Using System Instrumentation
      4. Using Event Monitoring Tools
      5. Using Diagnostic Tools
      6. Collecting Disk Performance Data
      7. Using Disk Performance Data
      8. Avoiding Disk Problems
      9. Recovering from Disk Problems
      10. Comparing Disk Monitoring Products
      11. Case Study: Configuring and Monitoring for Mirrored Disks
    9. Monitoring the Network
      1. Identifying Important Network Components to Monitor
      2. Using Graphical Network Status Monitors
      3. Monitoring Network Interface Card and Cable Failures
      4. Monitoring Networking and Transport Protocols
      5. Monitoring Network Services
      6. Monitoring Network Hosts
      7. Collecting Network Performance Data
      8. Using Network Performance Data
      9. Avoiding Network Problems
      10. Recovering from Network Problems
    10. Monitoring the Application
      1. Important Application Components to Monitor
      2. Identifying Application Types
      3. Using Standard Commands and Tools
      4. Using System Instrumentation
      5. Fault Detection Tools
      6. Monitoring Tools for ERP Applications
      7. Resource and Performance Monitoring Tools
      8. Controlling Application Performance
      9. Recovering from Application Problems
      10. Comparison of Application Monitoring Products
    11. Monitoring the Database
      1. Identifying Important Database Monitoring Categories
      2. Using Standard Database Commands and Tools
      3. Using Fault Detection and Recovery Tools
      4. Resource and Performance Monitoring Tools
      5. Using Database Performance Data
      6. Avoiding Database Problems
      7. Recovering from Database Problems
      8. Comparison of Database Monitoring Products
    12. Enterprise Management
      1. Monitoring Across an Enterprise
      2. Identifying Events
      3. Using Event Correlation Tools
      4. Monitoring Multiple Systems
      5. Enterprise Management Frameworks
      6. Using Multiple Tools
    13. UNIX Futures
      1. Future Trends in Fault Management
    14. Standards
      1. Using SNMP and MIBs
      2. Using DMI and MIFs
    15. Glossary