You are previewing IBM High Performance Computing Cluster Health Check.
O'Reilly logo
IBM High Performance Computing Cluster Health Check

Book Description

This IBM® Redbooks® publication provides information about aspects of performing infrastructure health checks, such as checking the configuration and verifying the functionality of the common subsystems (nodes or servers, switch fabric, parallel file system, job management, problem areas, and so on).

This IBM Redbooks publication documents how to monitor the overall health check of the cluster infrastructure, to deliver technical computing clients cost-effective, highly scalable, and robust solutions.

This IBM Redbooks publication is targeted toward technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) responsible for delivering cost-effective Technical Computing and IBM High Performance Computing (HPC) solutions to optimize business results, product development, and scientific discoveries. This book provides a broad understanding of a new architecture.

Table of Contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. Preface
    1. Authors
    2. Now you can become a published author, too!
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  4. Chapter 1. Introduction
    1. 1.1 Overview of the IBM HPC solution
    2. 1.2 Why we need a methodical approach for cluster consistency checking
    3. 1.3 Tools and interpreting their results for HW and SW states
      1. 1.3.1 General Parallel File System
      2. 1.3.2 Extreme Cloud Administration Toolkit
      3. 1.3.3 The OpenFabrics Enterprise Distribution
      4. 1.3.4 Red Hat Package Manager
    4. 1.4 Tools and interpreting their results for identifying performance inconsistencies
    5. 1.5 Template of diagnostics steps that can be used (checklists)
  5. Chapter 2. Key concepts and interdependencies
    1. 2.1 Introduction to High Performance Computing
    2. 2.2 Rationale for clusters
    3. 2.3 Definition of an HPC Cluster
    4. 2.4 Definition of a “healthy cluster”
    5. 2.5 HPC preferred practices
  6. Chapter 3. The health lifecycle methodology
    1. 3.1 Why a methodology is necessary
    2. 3.2 The health lifecycle methodology
    3. 3.3 Practical application of the health lifecycle methodology
      1. 3.3.1 Deployment phase
      2. 3.3.2 Verification or pre-production readiness phase
      3. 3.3.3 Production phase (monitoring)
  7. Chapter 4. Cluster components reference model
    1. 4.1 Overview of installed cluster systems
    2. 4.2 ClusterA nodes hardware description
    3. 4.3 ClusterA software description
    4. 4.4 ClusterB nodes hardware description
    5. 4.5 ClusterB software description
    6. 4.6 ClusterC nodes hardware description
    7. 4.7 ClusterC software description
    8. 4.8 Interconnect infrastructure
      1. 4.8.1 InfiniBand
      2. 4.8.2 Ethernet Infrastructure
      3. 4.8.3 IP Infrastructure
    9. 4.9 GPFS cluster
  8. Chapter 5. Toolkits for verifying health (individual diagnostics)
    1. 5.1 Introduction to CHC
      1. 5.1.1 Requirements
      2. 5.1.2 Installation
      3. 5.1.3 Configuration
      4. 5.1.4 Usage
    2. 5.2 Tool output processing methods
      1. 5.2.1 The plain mode
      2. 5.2.2 The xcoll mode
      3. 5.2.3 The compare (config_check) mode
    3. 5.3 Compute node
      1. 5.3.1 The leds check
      2. 5.3.2 The cpu check
      3. 5.3.3 The memory check
      4. 5.3.4 The os check
      5. 5.3.5 The firmware check
      6. 5.3.6 The temp check
      7. 5.3.7 The run_daxpy check
      8. 5.3.8 The run_dgemm check
    4. 5.4 Ethernet network: Port status, speed, bandwidth, and port errors
      1. 5.4.1 Ethernet firmware and drivers
      2. 5.4.2 Ethernet port state
      3. 5.4.3 Network settings
      4. 5.4.4 Bonding
    5. 5.5 InfiniBand: Port status, speed, bandwidth, port errors, and subnet manager
      1. 5.5.1 The hca_basic check
      2. 5.5.2 The ipoib check
      3. 5.5.3 The switch_module check
      4. 5.5.4 The switch_ntp check
      5. 5.5.5 The switch_inv check
      6. 5.5.6 The switch_health check
      7. 5.5.7 The switch_clk check
      8. 5.5.8 The switch_code check
      9. 5.5.9 The run_ppping check
      10. 5.5.10 The run_jlink check
      11. 5.5.11 The run_ibtools check
    6. 5.6 File system: Accessibility, usage, and read/write performance
      1. 5.6.1 The fs_usage check
      2. 5.6.2 NFS file system
      3. 5.6.3 GPFS file system
  9. Appendix A. Commonly used tools
    1. Overview of non-CHC tools
    2. InfiniBand-related tools
    3. GPFS related tools
    4. Network-related tools
    5. Disk benchmarks
    6. Node-specific tools
    7. IBM HPC Central
  10. Appendix B. Tools and commands outside of the toolkit
    1. Remote management access
  11. Related publications
    1. IBM Redbooks
    2. Online resources
    3. Help from IBM
  12. Back cover