You are previewing Panic! UNIX® System Crash Dump Analysis.
O'Reilly logo
Panic! UNIX® System Crash Dump Analysis

Book Description

Designed as an introduction to UNIX system crash dump analysis, this is the first book to discuss in detail UNIX system panics, crashes and hangs, their causes, what to do when they occur, how to collect information about them, how to analyze that information, and how to get the problem resolved. KEY TOPICS: Part One covers theory and tools. Part Two looks inside UNIX, from the header files to hardware tape drives. Part Three provides actual case studies of software, hardware, data, and system fault problems. MARKET: For systems and network administrators and technical support engineers responsible for maintaining UNIX computer systems and networks.

Table of Contents

  1. Copyright
  2. Figures
  3. Tables
  4. Code Examples
  5. Acknowledgments
  6. Introduction
    1. What happened?
    2. What will Panic! teach you?
    3. So many flavors of UNIX
    4. The audience
    5. Conventions used
    6. Contacting the authors
    7. Welcome to Panic!
  7. 1. Getting Started
    1. 1. My System Has Crashed!
      1. What is a system crash?
      2. What conditions cause panics?
      3. A word about bad traps
      4. The panic() routine
      5. How do you know if your system has panic’ed?
        1. Panic messages
        2. Stack traceback
        3. Dumping messages
        4. Reboot
      6. Capturing system crash information
      7. What is a program crash in comparison?
    2. 2. My System Is Hung!
      1. What is a system hang?
      2. What conditions cause hangs?
      3. How do you know if your system is hung?
      4. What is a program hang in comparison?
      5. Capturing system hang information
    3. 3. The savecore Program
      1. What is savecore?
      2. How does savecore work?
      3. Disk space requirement & locations
      4. Security issues
      5. Solaris 1: How to set up savecore
        1. Customizing /etc/rc.local
        2. Configuring a special dump device
      6. Solaris 2: How to set up savecore
        1. Customizing /etc/rc2.d/S20sysetup
        2. Configuring a special dump device
        3. Swapless systems
    4. 4. Hey! We Got One!
      1. What to do when your system has crashed
        1. Was the system recently tuned?
        2. Has anything else changed recently?
      2. Is the system still usable?
      3. Turn off savecore? (How many dumps will you need?)
      4. Saving the crash to tape for shipment or archives
    5. 5. Crashing Your Own System
      1. Crash your Solaris 2 system
      2. Crash your Solaris 1 system
    6. 6. Initial Analysis Without adb
      1. Identifying the UNIX release & hardware architecture
      2. The message buffer, msgbuf
        1. Strings and the case of the unknown customer
      3. Process status utilities: ps & pstat
      4. Network status: netstat
      5. NFS status: nfsstat
      6. Address resolution protocol status: arp
      7. Interprocess communication status: ipcs
      8. The crash program
      9. Summary
    7. 7. Introduction to adb
      1. Other debuggers
        1. dbx, dbxtool, & debugger
        2. The kernel resident absolute debugger, kadb
        3. The crash program
      2. adb hardware & software requirements
        1. Architecture & OS mismatches: Some adb error messages
      3. The distribution of adb
      4. The different uses of adb & kadb
        1. The object file
        2. The core file
        3. Using adb on crash dumps
        4. Using adb on live systems
        5. The kernel resident absolute debugger, kadb
      5. adb macros & /usr/lib/adb
      6. General startup syntax
        1. User program debugging
        2. Examining system crash dump postmortem files
        3. Examining a live system: Solaris 1
        4. Examining a live system: Solaris 2
      7. Security issues
      8. Other helpful files
    8. 8. adb: The Gory Details
      1. Basic commands
      2. Displaying data
        1. Addressing
          1. Binary operators
          2. Unary operators
          3. Pointers
          4. Logical negation
          5. Counts
          6. Commands
          7. The ? display command
          8. The / display command
        2. Formats
          1. Displaying data
          2. Formatting the output
      3. Locations & sizes
      4. Miscellaneous commands
        1. Value conversion
        2. $ commands
        3. Pattern searching
        4. Variables
        5. Writing to the object and core files
        6. Address map
        7. Interactive debugging sessions (process control)
      5. Summary
    9. 9. Initial Analysis Using adb
      1. Starting an adb session
      2. System identification
      3. Boot time, crash time, and uptime
      4. Panic strings
      5. The message buffer, msgbuf
      6. Stack tracebacks
      7. Summary
    10. 10. The /usr/include Header Files
      1. What is a header file?
      2. The /usr/include directories
        1. /usr/include
        2. /usr/include/admin
        3. /usr/include/arpa
        4. /usr/include/bsm
        5. /usr/include/des
        6. /usr/include/inet
        7. /usr/include/kerberos
        8. /usr/include/net
        9. /usr/include/netinet
        10. /usr/include/nfs
        11. /usr/include/protocols
        12. /usr/include/rpc
        13. /usr/include/rpcsvc
        14. /usr/include/security
        15. /usr/include/sys
        16. /usr/include/sys/debug
        17. /usr/include/sys/fpu
        18. /usr/include/sys/fs
        19. /usr/include/sys/proc
        20. /usr/include/sys/scsi
        21. /usr/include/sys/scsi/adapters
        22. /usr/include/sys/scsi/conf
        23. /usr/include/sys/scsi/generic
        24. /usr/include/sys/scsi/impl
        25. /usr/include/sys/scsi/targets
        26. /usr/include/vm
      3. /usr/kvm/sys
      4. /usr/share/src/uts
      5. /usr/ucbinclude
      6. Summary
    11. 11. Symbol Tables
      1. Namelists & the nm command
        1. A tiny example using Solaris 2
        2. A tiny example using Solaris 1
      2. Using adb to look at tiny’s variables
      3. A tiny summary
    12. 12. adb Macros: Part One
      1. The macro library
      2. Reading and understanding macros
        1. Invoking adb macros
      3. The utsname macro
      4. The bootobj macro
      5. Summary
    13. 13. adb Macros: Par t Two
      1. The msgbuf and msgbuf.wrap macros
        1. Calling another macro
        2. Command count
        3. The msgbuf macro in use
      2. The cpus, cpus.nxt, & cpu macros
      3. Summary
    14. 14. adb Macros: Writing Your Own
      1. Exercise 1: Initial information
        1. Task:
        2. Hint:
        3. An example of output to aim for:
      2. Exercise 2: DNLC, the directory name lookup cache
        1. Task:
        2. Hints:
        3. An example of output to aim for:
      3. Exercise 3: Swap information
        1. Task:
        2. Hints:
        3. An example of output to aim for:
      4. Extra Credit Challenge: Which process on which CPU?
        1. Task:
        2. Givens:
        3. Hints:
        4. An example of output to aim for:
      5. Possible solutions
        1. Solution to exercise 1: Initial information
        2. Solution to exercise 2: DNLC, the directory name lookup cache
        3. Solution to exercise 3: Swap information
        4. Extra Credit Challenge: Which process on which CPU?
  8. 2. Advanced Studies
    1. 15. Introduction to Assembly
      1. High-level vs. low-level languages
      2. Assembly languages
      3. Basic CPU structure (all CPUs are similar)
      4. Instruction execution
        1. Instruction pipelining
        2. Floating-point coprocessors
      5. Instruction types
      6. Instruction formats and addressing modes
      7. Addressing and registers
      8. Data in memory
      9. On to SPARC!
    2. 16. Introduction to SPARC
      1. Basic characteristics of SPARC assembly language
      2. SPARC instructions
      3. SPARC registers
        1. Passing parameters when calling routines
        2. Register windows
        3. Register usage
          1. Global register zero
      4. SPARC instruction types
      5. Delayed Control Transfer Instructions
      6. Looking at instructions in memory
        1. How load and store instructions can go wrong
        2. How branch instructions can go wrong
        3. How other instructions can go wrong
        4. Finding trouble
      7. Want to learn more?
    3. 17. Stacks
      1. A generic stack
      2. The frame structure
        1. Solaris 1 header files
        2. Solaris 2 header files
      3. Instructions that affect windows & frames
        1. Windows diagrammed
        2. The save instruction
        3. The call and jmpl instructions
        4. The restore instruction
        5. Window overflows & underflows
      4. What have we got so far?
    4. 18. Stack Tracebacks
      1. Compiling with optimization
      2. The trouble with more than six arguments
    5. 19. A Kernel Overview
      1. Major sections
      2. Entering the kernel
      3. Scheduling: processes and threads
        1. SunOS 4.x
        2. Solaris 2
      4. File systems
      5. Files, inodes, and processes
      6. Memory Management
      7. This was just a kernel overview
    6. 20. Virtual Memory
      1. The free list
      2. Swap space
      3. Page faults
      4. Working set, or resident set
      5. Keeping track of pages
      6. Keeping track of process space
      7. Anonymous memory
      8. Kernel functions
        1. malloc(): SunOS 4.x
        2. malloc(): Solaris 2
      9. Virtual memory routines
      10. Address spaces
      11. Segments
      12. Pages
      13. The hat layer
    7. 21. Scheduling
      1. SunOS 4.x
      2. Solaris 2
      3. How you change it
    8. 22. File Systems
      1. Basic disk structure
      2. The old original
        1. The BSD file system
        2. Broadening horizons
      3. VFS functions
        1. Vnodes
        2. General VFS & vnode operations
      4. UNIX File System (UFS)
      5. Other file systems
    9. 23. Hardware Devices and Drivers
      1. Drivers and device control
        1. SunOS 4.x drivers
          1. Autoconfiguration
          2. Device switches
          3. Driver code
        2. Solaris 2 drivers
      2. Driver functions
      3. Real hardware
      4. Drivers and crashes
    10. 24. Interprocess Communication
      1. Semaphores
        1. Tunable parameters for semaphores
        2. Internal variables
        3. Internal structures
        4. Functions
      2. Messages
        1. Tunable parameters for messages
        2. Internal variables
        3. Internal structures
        4. Functions
      3. Shared memory
        1. Tunable parameters for shared memory
        2. Internal variables
        3. Internal structures
        4. Functions
      4. Common functions
      5. Why all this?
    11. 25. STREAMS
      1. STREAMS structure
      2. Data structures
      3. Queues
      4. Message structures
      5. Data blocks
      6. Pipes
      7. Basic functions
      8. Support functions
      9. Digging around inside
    12. 26. Trap Handling
      1. Kinds of traps
      2. Trap sequence
      3. Trap frames
      4. Trap types
        1. Returning from traps
    13. 27. Watchdog Resets
      1. What is a watchdog?
        1. sun4d systems
        2. /usr/kvm/prtdiag: A special sun4d command
      2. Can you get a core file?
      3. What do you do next?
        1. An alternative console device
      4. Watchdog analysis
      5. Summary
      6. For further reference
    14. 28. Interrupts
      1. SPARC systems
      2. Priority levels
      3. Serial devices
      4. Vectored interrupts
      5. Polled interrupts
      6. Interrupts in tracebacks
    15. 29. Multiprocessor Kernels
      1. Data protection
      2. SunOS 4.x multiprocessor systems
        1. SunOS 4.x lock code
        2. SunOS 4.x CPU structure
      3. Solaris 2
        1. Atomic instructions
        2. Mutex structure
        3. Other locks: semaphores, reader/writers, condition variables
        4. Blocking & sleeping
        5. Solaris 2 waiters
        6. Mutex locks
  9. 3. Case Histories
    1. 30. Network Troubles
      1. Initial analysis
      2. Check the instruction
      3. Bug check
      4. Resolution
    2. 31. A Stomped-on Module
      1. Strings output
      2. Analysis using adb
      3. Walking the stack by hand
      4. The ipcaccess() routine
      5. Loading the semsys module with modload
      6. Remember to use the same OS!
      7. Is it a hardware failure?
      8. How about a software problem?
      9. Using nm to query symbol values
      10. adb’s search command at work
        1. Masked searches within adb
      11. Conclusion
    3. 32. Hanging Instead of Swapping
      1. Initial information
      2. Process status
      3. Stack tracebacks for every process
      4. Resolution
    4. 33. Panic’ed Pipes
      1. Always get initial information
      2. Walking the stack by hand
      3. Examining assembly code: fifo_rdwr()
      4. Calling parameters
      5. Examining assembly code: vno_rw()
      6. Don’t work too hard! Use SunSolve!
      7. Resolution
    5. 34. A Sleeping Dragon
      1. Initial information
      2. Walking the stack by hand
      3. Using the threadlist macro
      4. Examining mutex locks
      5. Data address not found? Dig deeper!
      6. Additional analysis tools
      7. Which processes were waiting for locks?
      8. Loadable kernel modules
      9. Conclusion
    6. 35. Once Is Not Enough
      1. The first captured crash
        1. Always get initial information
        2. A trap occurred
        3. The sethi instruction
        4. Examining nPC
        5. Something went wrong!
      2. The second captured crash
        1. Initial information
        2. Collecting trap information
        3. The orcc instruction
        4. Examining nPC
      3. The other crashes
      4. The solution
      5. Conclusion
    7. 36. Life Without A Root Directory
      1. Get initial information
      2. Invoking mutex_enter()
      3. What process was involved?
      4. Going back through time
      5. What’s still useful in the stack traceback?
        1. The lookupname() routine
        2. The pn_get() routine
        3. The lookuppn() routine
        4. The magic of %g7 in Solaris 2.3 & 2.4
      6. Why was FrameMaker involved?
    8. 37. Disk Woes in the Wee Hours
      1. Get initial information
      2. Look for patterns
      3. Inodes & vnodes
      4. Crash 11: A closer look
      5. Crash 12: A closer look
      6. Crashes 13 & 14: A closer look
      7. Resolution
  10. A. SPARC: The Gory Details
    1. All CPUs are similar
    2. The SPARC processor
    3. Integer Unit (IU)
      1. General-purpose registers
      2. PC — Program Counter
      3. nPC — next Program Counter
      4. PSR — Processor Status Register
      5. TBR — Trap Base Register
      6. WIM — Window Invalid Mask
      7. Y — Multiply/Divide Register
      8. ASRs — Ancillary Status Registers (optional)
      9. DTQs - Deferred-Trap Queues (optional)
    4. Floating-Point Unit (FPU)
      1. FPU F Registers
      2. Floating-point State Register
      3. Floating-point deferred-trap queue (FQ)
    5. Coprocessor (CP)
    6. Windows & use of SPARC registers
    7. In closing
  11. B. SPARC Instruction Set
    1. Instruction set summary
      1. Operation codes & instruction formats
      2. Assembly language syntax
      3. Instruction syntax
        1. Registers
        2. Special symbol names
        3. Operand values
        4. Register values
        5. Labels
    2. Memory access instructions
      1. Load instructions
      2. Loading from alternate address space
      3. Loading from the floating-point unit & coprocessor
      4. Store instructions
      5. Storing from the floating-point unit & coprocessor
      6. Atomic memory access instructions
      7. Possible traps during memory access instructions
    3. Arithmetic / logical / shift instructions
      1. Integer arithmetic
      2. Logical instructions
      3. Shift instructions
      4. Miscellaneous arithmetic / logical / shift instructions
    4. Control transfer instructions
      1. Branch on integer condition codes instructions
      2. Branch on FPU condition codes instructions
      3. Branch on coprocessor condition codes instructions
      4. The annul bit
      5. Unconditional branches
      6. Trap on integer condition codes instructions
      7. Orderly return from a trap
    5. State register instructions
      1. Miscellaneous state register instructions
    6. Floating-point unit instructions
      1. Floating-point arithmetic instructions
      2. Floating-point value conversions
      3. Floating-point value comparisons
      4. Miscellaneous floating-point instructions
    7. Coprocessor instructions
    8. Synthetic instructions
    9. Always keep an instruction set reference handy!