You are previewing High Availability: Design, Techniques, and Processes.
O'Reilly logo
High Availability: Design, Techniques, and Processes

Book Description

The complete "how-to guide" for maximizing the availability of enterprise systems.

Training, support, backup, and maintenance account for nearly 80 percent of the total cost of today's enterprise applications-and much of that money is spent trying to squeeze increased availability out of applications in spite of weak design and management processes. In High Availability, two leading IT experts bring together best practices for every people and process-related issue associated with maximizing application availability. The goal: to help enterprises dramatically improve the value of their strategic applications, without investing a dime more than necessary.

  • Enhancing all four key elements of availability: reliability, recoverability, serviceability, and manageability

  • Understanding how your users define availability

  • Planning achievable service level agreements-and delivering on them

  • Strategies for multiple platforms, from the mainframe to the desktop

  • Lowering administrative costs through standardization and other techniques

  • Redundancy, backup, fault tolerance, partitioning, automation, and other high availability solutions

  • Leveraging availability features built into your existing hardware and operating systems

  • Discover how to create systems that will be easier to maintain, anticipate and prevent problems, and define ongoing availability strategies that account for business change. Whatever your IT role, whatever your IT architecture, this book can help you deliver the breakthrough availability levels your organization needs right now.

    Table of Contents

    1. Copyright
    2. Introduction
    3. Acknowledgments
    4. About the Authors
    5. 1. Today's Computing Environment
      1. Complexity, Complexity, Complexity
        1. Multiple Technologies and Protocols
        2. Multiple Vendors
        3. Varied Users
        4. Multiple Locations
        5. Rapid Change
        6. Greater Business Demands
        7. A Daunting Environment To Work In
      2. The Total Cost of Ownership Issue
        1. Total Cost of Ownership Defined
        2. Industry TCO Estimates
        3. What TCO Studies Reveal
        4. The Underlying Reason for High TCO
        5. A Typical Scenario: Choosing Office Systems
        6. Availability as the Most Significant Contributor to TCO
      3. Summary
    6. 2. Achieving Higher Availability
      1. Determining User Availability Requirements
        1. The Service Level Agreement
        2. Helping Users Identify Their Availability Requirements
      2. Availability Levels and Measurements
        1. High Availability Level
        2. Continuous Operations Level
        3. Continuous Availability
        4. Quantifying Availability Targets
        5. Availability: A User Metric
        6. Measuring End-To-End Availability
      3. Summary
    7. 3. Planning for System Availability
      1. Identifying System Components
      2. Addressing Critical Components
      3. The Four Elements of Availability
      4. Summary
    8. 4. Preparing for Systems Management
      1. Processes, Data, Tools, and Organization
      2. Systems Management in the PC World (or the Lack of It)
      3. IT Organizations: Away from Centralization, Then Back Again
      4. Understanding the Systems To Manage
      5. The Basics of Management: Five Phases
        1. Phase 1: Setting Objectives
        2. Phase 2: Planning
        3. Phase 3: Execution
        4. Phase 4: Measurement
        5. Phase 5: Control
      6. Identifying the Systems Management Disciplines
    9. 5. Implementing Service-Level Management
      1. Service-Level Management
        1. Process Requirements
          1. Step 1: Define service-level standards
          2. Step 2: Establish service levels to be attained
          3. Step 3: Monitor achievement of service levels
          4. Step 4: Analyze service-level attainment and report to higher management
          5. Step 5: Redefine service levels, if necessary
        2. Data and Measurement Requirements
        3. Organization Requirements
        4. Tools Requirements
        5. Benefits of Service-Level Management
      2. Problem Management
        1. Process Requirements
          1. Step 1: Define problem management process and practices
          2. Step 2: Detect or recognize the problem
          3. Step 3: Bypass the problem
          4. Step 4: Analyze the problem
          5. Step 5: Manage the problem to resolution
          6. Step 6: Report on the status and trends of problems
          7. Step 7: Redefine the problem management process if necessary
        2. Data Requirements
        3. Organization Requirements
        4. Tools Requirements
        5. Benefits of Problem Management
      3. Change Management
        1. Process Requirements
          1. Step 1: Define change management process and practices
          2. Step 2: Receive change requests
          3. Step 3: Plan for implementation of changes
          4. Step 4: Implement and monitor the changes, back out changes if necessary
          5. Step 5: Evaluate and report on changes implemented
          6. Step 6: Modify change management plan if necessary
        2. Data Requirements
        3. Organization Requirements
        4. Tools Requirements
        5. Benefits
      4. Security Management
        1. Process Requirements
          1. Step 1: Determine and evaluate of IT assets
          2. Step 2: Analyze risk
          3. Step 3: Define security practices
          4. Step 4: Implement security practices
          5. Step 5: Monitor for violations and take corresponding actions
          6. Step 6: Reevaluate IT assets and risks
        2. Data Requirements
        3. Organization Requirements
        4. Tools Requirements
        5. Benefits
      5. Asset and Configuration Management
        1. Process Requirements
          1. Step 1: Define asset and configuration data requirements
          2. Step 2: Identify asset and configuration information gathering and update procedures
          3. Step 3: Gather asset and configuration information and update procedures
          4. Step 4: Provide information to other systems management disciplines
          5. Step 5: Analyze asset and configuration information quality
          6. Step 6: Reevaluate asset and configuration data and their update requirements
        2. Data Requirements
        3. Organization Requirements
        4. Tools Requirements
      6. Availability Management
        1. Process Requirements
          1. Step 1: Define a plan to achieve availability targets
          2. Step 2: Track availability targets and their achievement
          3. Step 3: Analyze and report on availability achievements
          4. Step 4: Update availability plan
        2. Data Requirements
        3. Organization Requirements
        4. Tools Requirements
        5. Benefits
    10. 6. From Centralized to Distributed Computing Environments
      1. Systems Management Disciplines
      2. The Centralized Computing Environment
      3. The Distributed Computing Environment
      4. Systems Management in Today's Computing Environment
        1. Defining Appropriate Functions and Control
          1. Centralized management and control
          2. Centralized management, distributed control
          3. Distributed management and control
        2. Choosing a Deployment Strategy
          1. Capability to manage remote resources
          2. Skills availability at remote locations
          3. Performance impact of managing remote resources centrally
          4. The need for greater security and control
          5. Physical proximity of resources to each other
      5. Developing a Deployment Strategy
        1. Management by Exception
        2. Policy-Based Management
        3. Standardization of Performance Data
        4. Accountability of the Distributed Systems Manager
        5. Central Definition of Systems Management Architectures
        6. Process Ownership
      6. Summary
    11. 7. Techniques That Address Multiple Availability Requirements
      1. Redundancy
        1. Hardware Redundancy Examples
        2. Software Redundancy Examples
        3. Environmental Redundancy Example
        4. Critical Success Factors
      2. Backup of Critical Resources
        1. Methods of Backup
        2. Hardware Backup Examples
        3. Software Backup Examples
        4. IT Operations Backup Examples
        5. Critical Success Factors
          1. Currency of backup
          2. Automated updating of backups
          3. Isolation of backup from primary
          4. Backup and restore procedure review and testing
          5. Generations of backups
          6. Integrity verification
      3. Clustering
        1. Comparing Clustering and Redundancy
        2. Hardware and Software Clustering Examples
        3. IT Operations Clustering Examples
        4. Environmental Clustering Examples
        5. Critical Success Factors
          1. Automatic load-sharing
          2. Physical separation of clustered components
      4. Fault Tolerance
        1. Hardware Fault Tolerance Examples
        2. Software Fault Tolerance Examples
        3. Environmental Fault Tolerance Examples
        4. Critical Success Factors
      5. Isolation or Partitioning
        1. Hardware Isolation Examples
        2. Software Isolation Examples
        3. Other Benefits of Isolation
          1. Minimize risk of changes
          2. Reduce resource contention
          3. Maximize resources
          4. Simpler systems management procedures
        4. Critical Success Factors
      6. Automated Operations
        1. Console and Network Operations Examples
        2. Workload Management Examples
        3. System Resource Monitoring Examples
        4. Problem Management Applications
        5. Distribution of Resources Example
        6. Backup and Restore Examples
        7. Critical Success Factors
      7. Access Security Mechanisms
        1. Steps to Secure Access
          1. Step 1: Identify the person requesting access
          2. Step 2: Verify the identity
          3. Step 3: Control access
          4. Step 4: Monitor all activities
        2. Types of Security
          1. Physical security
          2. Network security
          3. Application security
          4. Computer resource security
        3. Password Management
          1. Step 1: Enforce password selection guidelines
          2. Step 2: Expire passwords regularly
          3. Step 3: Expire assigned passwords on first use
          4. Step 4: Disable user accounts after successive invalid password attempts
          5. Step 5: Educate users on how to protect their password information
        4. Critical Success Factors
      8. Standardization
        1. Hardware Standardization Examples
        2. Software Standardization Examples
        3. Network Standardization Examples
        4. Processes and Procedures Standardization Examples
        5. Naming Standardization Examples
        6. Critical Success Factors
        7. Transitioning to Standardization
      9. Summary
    12. 8. Special Techniques for System Reliability
      1. The Use of Reliable Components
        1. Techniques for Maximizing Hardware Component Reliability
          1. Choose components with low failure rates
          2. Choose components that have high MTBF
          3. Purchase from reputable suppliers
          4. Use technical specifications as a gauge
          5. Choose products with fewer parts or greater integration
          6. Avoid newly developed products whenever possible
          7. Follow maintenance schedules diligently
        2. Techniques for Maximizing Software Component Reliability
          1. Avoid using "Version 1" and "Beta" software
          2. Don't use shareware or freeware
          3. Buy industry-standard software from reliable vendors
          4. Prior to installation, test for viruses
          5. Provide menus and other ways to control user inputs
          6. Reuse bug-free components or modules
          7. Test programs thoroughly
          8. Run "beta tests" with a controlled set of users
          9. Install the latest application software fixes judiciously
          10. Install the latest device drivers when available
          11. Upgrade to newer operating systems with caution
          12. Minimize the use of system utilities
        3. Personnel-Related Techniques for Maximizing Reliability
          1. Ensure high-quality user training
          2. Ensure quality training of support staff members
          3. Be wary of contractual hires
        4. Environment-Related Techniques for Maximizing Reliability
          1. Install Automatic Voltage Regulators (AVRs)
          2. Use adequate air-conditioning equipment
        5. Some Reliability Indicators for Suppliers
          1. Time in business
          2. Quality certification
          3. Industry awards
          4. Peer recommendation
          5. Warranty and support
      2. Programming to Minimize Failures
        1. Correctness
          1. Ensure user requirements are adequately determined
          2. Prototype the application prior to detailed coding
          3. Revalidate user requirements midway through the project
          4. Beta test prior to wide-scale deployment
        2. Robustness
          1. Test against out-of-bounds values
          2. Trap errors and prevent them from propagating
          3. Anticipate external changes
        3. Extensibility
          1. User changes
          2. System platform changes
          3. Regulatory changes
          4. Budgetary changes
          5. Business volume changes
          6. Business demand changes
          7. Generous database field sizes
          8. Design with overcapacity
          9. Place constant values in a look-up table
        4. Reusability
      3. Implement Environmental Independence Measures
        1. Use Power Generators
        2. Use Independent Air-Conditioning Units
        3. Use Fire Protection Systems
        4. Use Raised Flooring
        5. Install Equipment Wheel Locks
        6. Locate Computer Room on the Second Floor
      4. Utilize Fault Avoidance Measures
        1. Analyzing Problem Trends and Statistics
        2. Use of Advanced Hardware Technologies
        3. Use of Software Maintenance Tools
      5. Summary
    13. 9. Special Techniques for System Recoverability
      1. Automatic Fault Recognition
        1. Parity Checking Memory
        2. ECC Memory
        3. Data Validation Routines
      2. Fast Recovery Techniques
      3. Minimizing Use of Volatile Storage Media
        1. Regular Database Updates to Central Storage
        2. Automatic File-Save Features
      4. Summary
    14. 10. Special Techniques for System Serviceability
      1. Online System Redefinition
        1. Add or Remove I/O Devices
        2. Selectively Power Down Subsystems
        3. Commit or Reject Changes
      2. Informative Error Messages
        1. Use Standard Corporate Terminology
        2. Adopt Terms Already Used by Common Applications
        3. Tell What, Why, Impact, and How
        4. Implement Context-Sensitive Help
        5. Give Options for Viewing More Detailed Error Information
        6. Make Error Information Available After the Error Has Been Cleared
      3. Complete Documentation
        1. Have a Manual of Operations On Hand
        2. Write Basic Problem Isolation and Recovery Guides
        3. Provide System Configuration Diagrams
        4. Label Resources
        5. Provide a Complete Technical Library
      4. Installation of Latest Fixes and Patches
      5. Summary
    15. 11. Special Techniques for System Manageability
      1. Use Manageable Components
        1. Simple Network Management Protocol (SNMP)
        2. Common Management Information Protocol (CMIP)
        3. Desktop Management Interface (DMI)
        4. Common Information Management Format (CIM)
        5. Wired for Management (WfM)
      2. Management Applications
        1. Systems Management Issues
          1. Deployment
          2. Operations
          3. Security
        2. Automated Systems Management Capabilities
        3. System Management Applications and Frameworks
          1. Unicenter TNG (Computer Associates)
          2. Tivoli (IBM)
      3. Educate IS Personnel on Systems Management Disciplines
        1. Business Value of the Information System
        2. Value of Systems Management Disciplines
        3. Principles of Management
        4. Basic Numerical Analysis Skills
      4. Summary
    16. 12. All Together Now
      1. The Value of Systems Management Disciplines
      2. Which One First?
      3. Analyze Outages
      4. Identify Single Points of Failure
      5. Exploit What You Have
      6. An Implementation Strategy
      7. Summary
    17. A. Availability Features of Selected Products
      1. Availability Features of Selected Operating Systems
        1. Availability Features of Novell NetWare
          1. SFT II (System Fault Tolerance level 2) or mirroring/duplexing
          2. SFT III (System Fault Tolerance level 3)
          3. Dynamic load/unload
          4. Client auto-reconnect
          5. Kernel fault recovery or ABEND recovery
          6. Novell Replication Services (NRS)
          7. Novell Application Launcher (NAL)
          8. Hot plug PCI
          9. Multiprocessor Kernel (MPK)
          10. Intelligent I/O (I2O)
          11. Memory protection
          12. Flexible Mirroring, Phase I
          13. Novell Storage Services (NSS)
          14. Hierarchical Storage Management (HSM)
          15. NetWare Cluster Services (NWCS)
        2. Availability Features of Sun Solaris 8
          1. Sun Cluster
          2. Solaris Resource Manager/Solaris Bandwidth Manager
          3. Dynamic Reconfiguration/Automated Dynamic Reconfiguration
          4. Network multipathing
          5. Live upgrades
          6. Hot patching for diagnostics
          7. Improved crash dump analysis
          8. Improved program analysis
          9. Better examination of core files
          10. Bus performance monitoring
          11. Better management of core files
          12. Improved device configuration
          13. Macro-level debugging
          14. Remote console messaging
          15. TCP/IP network diagnostics
          16. IP packet routing observability
          17. System crash dump utility
          18. Enhanced process tracing
        3. Availability Features of AIX
          1. Logical Volume Manager (LVM)
          2. Disk mirroring
          3. Bad block relocation
          4. Journaled File System (JFS)
          5. Dynamic AIX kernel
          6. High Availability Cluster Multiprocessing (HACMP)
          7. System Resource Controller (SRC)
          8. Configuration manager
          9. AIX update facilities
        4. Availability Features of Microsoft Windows 2000 Server and Professional
          1. Windows File Protection
          2. Driver certification
          3. Kernel-mode write protection
          4. IIS application protection
          5. Cluster services and Network Load Balancing (Advanced Server and Datacenter Server)
          6. Job object API and process control
          7. Application certification & DLL protection
          8. Distributed File System (Dfs)
          9. Disk quotas
          10. Hierarchical Storage Management (HSM)
          11. Rolling upgrade support (Advanced Server and Datacenter Server)
          12. Dynamic Volume Management
          13. Error handling and protected subsystems
          14. Automatic restart
          15. Kill process tree
          16. System preparation tool
          17. Windows Installer
          18. Plug and Play (PnP)
          19. Service pack slipstreaming
          20. Integrated directory services (Active Directory)
          21. Windows Management Instrumentation (WMI)
          22. Delegated administration
          23. Microsoft Management Console (MMC)
          24. Windows Script Host (WSH)
          25. Group policies and centralized desktop management
          26. Recoverable file system
          27. Disk mirroring (RAID Level 1)
          28. Disk duplexing
          29. Disk striping with parity (RAID Level 5)
        5. Availability Features of IBM OS/400
          1. Policy-based Backup Recovery and Management Services (BRMS)
          2. Commitment control and journaling
          3. System Managed Access Path Protection (SMAPP)
      2. Availability Features of Selected Hardware Components
        1. Availability Features of IBM S/390 Integrated Server
          1. High availability storage devices
          2. SSA Disk: Serial Storage Architecture (SSA)
          3. Power management
        2. Availability Features of the IBM AS/400 Midrange System
          1. Logical partitions (LPAR)
          2. Disk failure recovery
          3. Power utility failure recovery
          4. Continuously Powered Main (CPM) storage
          5. Operations Navigator
          6. AS/400 system clustering
        3. Availability Features of the IBM RS/6000
          1. Built-in error detection and correction
          2. Backup power supply
          3. Battery backup systems
          4. Redundant or spare disks
          5. Hot-pluggable disk drives
          6. Multi-tailed disks and shared volume groups
          7. RAID disk arrays
        4. Availability Features of Compaq Proliant Servers
          1. ECC memory
          2. SMART-2 Array Controller technology
          3. Redundant network interface cards
          4. Uninterruptible Power Supply
          5. Compaq Insight Manager
          6. Standby Recovery Server
          7. On-Line Recovery Server
          8. Rapid recovery
          9. Redundant power supply
          10. Off-line backup processor
          11. External storage
      3. Availability Features of Selected Software Components
        1. Availability Features of the Oracle8i Database
          1. Cache Fusion clustering
          2. Fast-Start architecture
          3. Online reorganization
          4. Single system view