You are previewing Self-Service Linux®: Mastering the Art of Problem Determination.
O'Reilly logo
Self-Service Linux®: Mastering the Art of Problem Determination

Book Description

"This welcome addition to the Linux bookshelf provides real insight into the black-art of debugging. All too often debugging books concentrate solely on the tools but this book avoids that pitfall by concentrating on examples. The authors dissect and discuss each example in detail; in so doing they give invaluable insight into the Linux environment."

Richard J Moore, IBM Advanced Linux Response Team-Linux Technology Centre

"A plethora of Linux books exist but this guide offers a definitive overview of practical hints and tips for Linux users. Written by experts in the field, it will be extremely useful for system administrators and Linux enthusiasts."

Markus Rex, VP and General Manager, SUSE LINUX

  • The indispensable troubleshooting resource for every Linux administrator, developer, support professional, and power user!

  • Systematically resolve errors, crashes, hangs, performance slowdowns, unexpected behavior, and unexpected outputs

  • Master essential Linux troubleshooting tools, including strace, gdb, kdb, SysRq, /proc, and more

  • The indispensable start-to-finish troubleshooting guide for every Linux professional

    Now, there's a systematic, practical guide to Linux troubleshooting for every power user, administrator, and developer. In Self-Service Linux®, two of IBM's leading Linux experts introduce a four-step methodology for identifying and resolving every type of Linux-related system or application problem: errors, crashes, hangs, performance slowdowns, unexpected behavior, and unexpected outputs. You'll learn exactly how to use Linux's key troubleshooting tools to solve problems on your own—and how to make effective use of the Linux community's knowledge.

    If you use Linux professionally, this book can dramatically increase your efficiency, productivity, and marketability. If you're involved with deploying or managing Linux in the enterprise, it can help you significantly reduce operation costs, enhance availability, and improve ROI.

  • Discover proven best practices for diagnosing problems in Linux environments

  • Leverage troubleshooting skills you've developed with other platforms

  • Learn to identify problems with strace—the most frequently used Linux troubleshooting tool

  • Use /proc to uncover crucial information about hardware, kernels, and processes

  • Recompile open source applications with debug information

  • Debug applications with gdb, including C++ and threaded applications

  • Debug kernel crashes and hangs, one step at a time

  • Understand the Executable and Linking Format (ELF), and use that knowledge for more effective debugging

  • Includes a production-ready data collection script that can save you hours or days in debugging mission-critical Linux systems!

  • Series Editor Bruce Perens' is an open source evangelist, developer, and consultant whose software is a major component of most commercial embedded Linux offerings. He founded or cofounded Linux Standard Base, Open Source Initiative, and Software in the Public Interest. As Debian GNU/Linux Project Leader, he was instrumental in getting the system on two U.S. space shuttle flights.

    © Copyright Pearson Education. All rights reserved.

    Table of Contents

    1. Copyright
      1. Dedication
    2. Bruce Perens’ Open Source Series
    3. About Prentice Hall Professional Technical Reference
    4. About the Authors
    5. Preface
      1. What Is this Book About?
      2. Who Is This Book For?
      3. Acknowledgments
      4. Other
    6. 1. Best Practices and Initial Investigation
      1. 1.1. Introduction
      2. 1.2. Getting Your System(s) Ready for Effective Problem Determination
      3. 1.3. The Four Phases of Investigation
        1. 1.3.1. Phase #1: Initial Investigation Using Your Own Skills
          1. 1.3.1.1. Did Anything Change Recently?
        2. 1.3.2. Phase #2: Searching the Internet Effectively
          1. 1.3.2.1. Google
          2. 1.3.2.2. USENET
          3. 1.3.2.3. Linux Web Resources
          4. 1.3.2.4. Bugzilla Databases
          5. 1.3.2.5. Mailing Lists
        3. 1.3.3. Phase #3: Begin Deeper Investigation (Good Problem Investigation Practices)
          1. 1.3.3.1. Best Practices for Complex Investigations
            1. 1.3.3.1.1. Collect the Relevant Information When the Problem Occurs
            2. 1.3.3.1.2. Use an Investigation Log
            3. 1.3.3.1.3. Be Detailed (Avoid Qualitative Information)
            4. 1.3.3.1.4. Challenge Assumptions
            5. 1.3.3.1.5. Narrow Down the Scope of the Problem
          2. 1.3.3.2. Create a Reproducible Test Case
          3. 1.3.3.3. Work to Prove and/or Disprove Theories
          4. 1.3.3.4. The Source Code
        4. 1.3.4. Phase #4: Getting Help or New Ideas
          1. 1.3.4.1. Profile of a Linux Guru
          2. 1.3.4.2. Effectively Asking for Help
            1. 1.3.4.2.1. Netiquitte
            2. 1.3.4.2.2. Composing an Effective Message
            3. 1.3.4.2.3. Giving Back to the Community
            4. 1.3.4.2.4. USENET
            5. 1.3.4.2.5. Mailing Lists
            6. 1.3.4.2.6. Tips on Opening Bug Reports in Bugzilla
          3. 1.3.4.3. Use Your Distribution’s Support
      4. 1.4. Technical Investigation
        1. 1.4.1. Symptom Versus Cause
          1. 1.4.1.1. Error
          2. 1.4.1.2. Crashes
            1. 1.4.1.2.1. Traps
            2. 1.4.1.2.2. Panics
            3. 1.4.1.2.3. Kernel Crashes
          3. 1.4.1.3. Hangs (or Very Slow Performance)
            1. 1.4.1.3.1. Multi-Process Applications
            2. 1.4.1.3.2. Very Busy Systems
          4. 1.4.1.4. Performance
          5. 1.4.1.5. Unexpected Behavior/Output
      5. 1.5. Troubleshooting Commercial Products
      6. 1.6. Conclusion
    7. 2. strace and System Call Tracing Explained
      1. 2.1. Introduction
      2. 2.2. What Is strace?
        1. 2.2.1. More Information from the Kernel Side
        2. 2.2.2. When To Use It
        3. 2.2.3. Simple Example
        4. 2.2.4. Same Program Built Statically
      3. 2.3. Important strace Options
        1. 2.3.1. Following Child Processes
        2. 2.3.2. Timing System Call Activity
        3. 2.3.3. Verbose Mode
        4. 2.3.4. Tracing a Running Process
      4. 2.4. Effects and Issues of Using strace
        1. 2.4.1. strace and EINTR
      5. 2.5. Real Debugging Examples
        1. 2.5.1. Reducing Start Up Time by Fixing LD_LIBRARY_PATH
        2. 2.5.2. The PATH Environment Variable
        3. 2.5.3. stracing inetd or xinetd (the Super Server)
        4. 2.5.4. Communication Errors
        5. 2.5.5. Investigating a Hang Using strace
        6. 2.5.6. Reverse Engineering (How the strace Tool Itself Works)
      6. 2.6. System Call Tracing Example
        1. 2.6.1. Sample Code
        2. 2.6.2. The System Call Tracing Code Explained
      7. 2.7. Conclusion
    8. 3. The /proc Filesystem
      1. 3.1. Introduction
      2. 3.2. Process Information
        1. 3.2.1. /proc/self
        2. 3.2.2. /proc/<pid> in More Detail
          1. 3.2.2.1. /proc/<pid>/maps
            1. 3.2.2.1.1. Code Segment
            2. 3.2.2.1.2. Data Segment
            3. 3.2.2.1.3. Heap Segment
            4. 3.2.2.1.4. Mapped Base / Shared Libraries
            5. 3.2.2.1.5. Stack Segment
            6. 3.2.2.1.6. The Kernel Segment
            7. 3.2.2.1.7. 64-bit /proc/<pid>/maps Differences
        3. 3.2.3. /proc/<pid>/cmdline
        4. 3.2.4. /proc/<pid>/environ
        5. 3.2.5. /proc/<pid>/mem
        6. 3.2.6. /proc/<pid>/fd
        7. 3.2.7. /proc/<pid>/mapped_base
      3. 3.3. Kernel Information and Manipulation
        1. 3.3.1. /proc/cmdline
        2. 3.3.2. /proc/config.gz or /proc/sys/config.gz
        3. 3.3.3. /proc/cpufreq
        4. 3.3.4. /proc/cpuinfo
        5. 3.3.5. /proc/devices
        6. 3.3.6. /proc/kcore
        7. 3.3.7. /proc/locks
        8. 3.3.8. /proc/meminfo
        9. 3.3.9. /proc/mm
        10. 3.3.10. /proc/modules
        11. 3.3.11. /proc/net
        12. 3.3.12. /proc/partitions
        13. 3.3.13. /proc/pci
        14. 3.3.14. /proc/slabinfo
      4. 3.4. System Information and Manipulation
        1. 3.4.1. /proc/sys/fs
          1. 3.4.1.1. dir-notify-enable
          2. 3.4.1.2. file-nr
          3. 3.4.1.3. file-max
          4. 3.4.1.4. aio-max-nr, aix-max-pinned, aix-max-size, aio-nr, and aio-pinned
          5. 3.4.1.5. overflowgid and overflowuid
        2. 3.4.2. /proc/sys/kernel
          1. 3.4.2.1. core_pattern
          2. 3.4.2.2. msgmax, msgmnb, and msgmni
          3. 3.4.2.3. panic and panic_on_oops
          4. 3.4.2.4. printk
          5. 3.4.2.5. sem
          6. 3.4.2.6. shmall, shmmax, and shmmni
          7. 3.4.2.7. sysrq
            1. 3.4.2.7.1. showPc Output:
            2. 3.4.2.7.2. showMem Output:
            3. 3.4.2.7.3. showTasks Output:
          8. 3.4.2.8. tainted
        3. 3.4.3. /proc/sys/vm
      5. 3.5. Conclusion
    9. 4. Compiling
      1. 4.1. Introduction
      2. 4.2. The GNU Compiler Collection
        1. 4.2.1. A Brief History of GCC
        2. 4.2.2. GCC Version Compatibility
      3. 4.3. Other Compilers
      4. 4.4. Compiling the Linux Kernel
        1. 4.4.1. Obtaining the Kernel Source
        2. 4.4.2. Architecture Specific Source
        3. 4.4.3. Working with Kernel Source Compile Errors
          1. 4.4.3.1. A Real Kernel Compile Error Example
        4. 4.4.4. General Compilation Problems
          1. 4.4.4.1. Environment/Setup Errors or Differences
          2. 4.4.4.2. Compiler Version Differences or Bugs
          3. 4.4.4.3. User Error
          4. 4.4.4.4. Code Error
      5. 4.5. Assembly Listings
        1. 4.5.1. Purpose of Assembly Listings
        2. 4.5.2. Generating Assembly Listings
        3. 4.5.3. Reading and Understanding an Assembly Listing
      6. 4.6. Compiler Optimizations
      7. 4.7. Conclusion
    10. 5. The Stack
      1. 5.1. Introduction
      2. 5.2. A Real-World Analogy
      3. 5.3. Stacks in x86 and x86-64 Architectures
      4. 5.4. What Is a Stack Frame?
      5. 5.5. How Does the Stack Work?
        1. 5.5.1. The BP and SP Registers
          1. 5.5.1.1. Special Case: gcc’s -fomit-frame-pointer Compile Option
        2. 5.5.2. Function Calling Conventions
          1. 5.5.2.1. x86 Architecture
            1. 5.5.2.1.1. Return Value
          2. 5.5.2.2. x86-64 Architecture
            1. 5.5.2.2.1. Return Value
      6. 5.6. Referencing and Modifying Data on the Stack
      7. 5.7. Viewing the Raw Stack in a Debugger
      8. 5.8. Examining the Raw Stack in Detail
        1. 5.8.1. Homegrown Stack Traceback Function
          1. 5.8.1.1. Using GLIBC’s backtrace()
            1. 5.8.1.1.1. The -rdynamic Switch
          2. 5.8.1.2. Manually “Walking the Stack”
            1. 5.8.1.2.1. Modifying for x86-64
          3. 5.8.1.3. Stack Corruption
          4. 5.8.1.4. SIGILL Signals
            1. 5.8.1.4.1. Signals and the Stack
      9. 5.9. Conclusion
    11. 6. The GNU Debugger (GDB)
      1. 6.1. Introduction
      2. 6.2. When To Use a Debugger
      3. 6.3. Command Line Editing
      4. 6.4. Controlling a Process with GDB
        1. 6.4.1. Running a Program Off the Command Line with GDB
        2. 6.4.2. Attaching to a Running Process
        3. 6.4.3. Use a Core File
          1. 6.4.3.1. Changing Core File Name and Location
          2. 6.4.3.2. Saving the State of a GDB Session
      5. 6.5. Examining Data, Memory, and Registers
        1. 6.5.1. Memory Map
        2. 6.5.2. Stack
          1. 6.5.2.1. Navigating Stack Frames
          2. 6.5.2.2. Obtaining and Understanding Frame Information
        3. 6.5.3. Examining Memory and Variables
          1. 6.5.3.1. Variables and Scope and Type
          2. 6.5.3.2. Print Formatting
          3. 6.5.3.3. Determining the Type of Variable
          4. 6.5.3.4. Viewing Data in Memory
          5. 6.5.3.5. Formatting Values in Memory
          6. 6.5.3.6. Changing Variables
        4. 6.5.4. Register Dump
      6. 6.6. Execution
        1. 6.6.1. The Basic Commands
          1. 6.6.1.1. Notes on stepi
        2. 6.6.2. Settings for Execution Control Commands
          1. 6.6.2.1. Step-mode
          2. 6.6.2.2. Following fork Calls
          3. 6.6.2.3. Handling Signals
        3. 6.6.3. Breakpoints
        4. 6.6.4. Watchpoints
        5. 6.6.5. Display Expression on Stop
        6. 6.6.6. Working with Shared Libraries
          1. 6.6.6.1. Debugging Functions in Shared Libraries
      7. 6.7. Source Code
      8. 6.8. Assembly Language
      9. 6.9. Tips and Tricks
        1. 6.9.1. Attaching to a Process—Revisited
          1. 6.9.1.1. The pause() method
          2. 6.9.1.2. The “for” or “while” Loop Method
          3. 6.9.1.3. The xterm method
        2. 6.9.2. Finding the Address of Variables and Functions
        3. 6.9.3. Viewing Structures in Executables without Debug Symbols
        4. 6.9.4. Understanding and Dealing with Endian-ness
      10. 6.10. Working with C++
        1. 6.10.1. Global Constructors and Destructors
        2. 6.10.2. Inline Functions
        3. 6.10.3. Exceptions
      11. 6.11. Threads
        1. 6.11.1. Running Out of Stack Space
      12. 6.12. Data Display Debugger (DDD)
        1. 6.12.1. The Data Display Window
          1. 6.12.1.1. Viewing the Raw Stack
          2. 6.12.1.2. View Complex Data Structures
        2. 6.12.2. Source Code Window
        3. 6.12.3. Machine Language Window
        4. 6.12.4. GDB Console Window
      13. 6.13. Conclusion
    12. 7. Linux System Crashes and Hangs
      1. 7.1. Introduction
      2. 7.2. Gathering Information
        1. 7.2.1. Syslog Explained
        2. 7.2.2. Setting up a Serial Console
        3. 7.2.3. Connecting the Serial Null-Modem Cable
        4. 7.2.4. Enabling the Serial Console at Startup
        5. 7.2.5. Using SysRq Kernel Magic
        6. 7.2.6. Oops Reports
        7. 7.2.7. Adding a Manual Kernel Trap
        8. 7.2.8. Examining an Oops Report
          1. 7.2.8.1. 2.6.x Kernel Oops Dumps
        9. 7.2.9. Determining the Failing Line of Code
          1. 7.2.9.1. 2.4.x Kernel Oops Dumps
        10. 7.2.10. Kernel Oopses and Hardware
        11. 7.2.11. Setting up cscope to Index Kernel Sources
      3. 7.3. Conclusion
    13. 8. Kernel Debugging with KDB
      1. 8.1. Introduction
      2. 8.2. Enabling KDB
      3. 8.3. Using KDB
        1. 8.3.1. Activating KDB
        2. 8.3.2. Resuming Normal Execution
        3. 8.3.3. Basic Commands
      4. 8.4. Conclusion
    14. 9. ELF: Executable and Linking Format
      1. 9.1. Introduction
      2. 9.2. Concepts and Definitions
        1. 9.2.1. Symbol
          1. 9.2.1.1. Symbols Names and C Versus C++
        2. 9.2.2. Object Files, Shared Libraries, Executables, and Core Files
          1. 9.2.2.1. Object Files
          2. 9.2.2.2. Shared Libraries
          3. 9.2.2.3. Exectuables
          4. 9.2.2.4. Core Files
          5. 9.2.2.5. Static Libraries
        3. 9.2.3. Linking
          1. 9.2.3.1. Linking with Static Libraries
        4. 9.2.4. Run Time Linking
        5. 9.2.5. Program Interpreter / Run Time Linker
      3. 9.3. ELF Header
      4. 9.4. Overview of Segments and Sections
      5. 9.5. Segments and the Program Header Table
        1. 9.5.1. Text and Data Segments
      6. 9.6. Sections and the Section Header Table
        1. 9.6.1. String Table Format
        2. 9.6.2. Symbol Table Format
        3. 9.6.3. Section Names and Types
          1. 9.6.3.1. .bss
          2. 9.6.3.2. .data
          3. 9.6.3.3. .dynamic
          4. 9.6.3.4. .dynsym (symbol table)
          5. 9.6.3.5. .dynstr (string table)
          6. 9.6.3.6. .fini
          7. 9.6.3.7. .got (Global Offset Table)
          8. 9.6.3.8. .hash
          9. 9.6.3.9. .init
          10. 9.6.3.10. .interp
          11. 9.6.3.11. .plt (Procedure Linkage Table)
          12. 9.6.3.12. .rodata
          13. 9.6.3.13. .shstrtab
          14. 9.6.3.14. .strtab (string table)
          15. 9.6.3.15. .symtab (symbol table)
          16. 9.6.3.16. .text
          17. 9.6.3.17. .rel
      7. 9.7. Relocation and Position Independent Code (PIC)
        1. 9.7.1. PIC vs. non-PIC
        2. 9.7.2. Relocation and Position Independent Code
        3. 9.7.3. Relocation and Linking
      8. 9.8. Stripping an ELF Object
      9. 9.9. Program Interpreter
        1. 9.9.1. Link Map
      10. 9.10. Symbol Resolution
      11. 9.11. Use of Weak Symbols for Problem Investigations
      12. 9.12. Advanced Interception Using Global Offset Table
      13. 9.13. Source Files
      14. 9.14. ELF APIs
      15. 9.15. Other Information
      16. 9.16. Conclusion
    15. A. The Toolbox
      1. A.1. Introduction
      2. A.2. Process Information and Debugging
        1. A.2.1. Tool: GDB
        2. A.2.2. Tool: ps
        3. A.2.3. Tool: strace (system call tracer)
        4. A.2.4. Tool: /proc filesystem
        5. A.2.5. Tool: DDD (Data Display Debugger)
        6. A.2.6. Tool: lsof (List Open Files)
        7. A.2.7. Tool: ltrace (library call tracer)
        8. A.2.8. Tool: time
        9. A.2.9. Tool: top
        10. A.2.10. Tool: pstree
      3. A.3. Network
        1. A.3.1. Tool: traceroute
        2. A.3.2. File: /etc/hosts
        3. A.3.3. File: /etc/services
        4. A.3.4. Tool: netstat
        5. A.3.5. Tool: ping
        6. A.3.6. Tool: telnet
        7. A.3.7. Tool: host/nslookup
        8. A.3.8. Tool: ethtool
        9. A.3.9. Tool: ethereal
        10. A.3.10. File: /etc/nsswitch.conf
        11. A.3.11. File: /etc/resolv.conf
      4. A.4. System Information
        1. A.4.1. Tool: vmstat
        2. A.4.2. Tool: iostat
        3. A.4.3. Tool: nfsstat
        4. A.4.4. Tool: sar
        5. A.4.5. Tool: syslogd
        6. A.4.6. Tool: dmesg
        7. A.4.7. Tool: mpstat
        8. A.4.8. Tool: procinfo
        9. A.4.9. Tool: xosview
      5. A.5. Files and Object Files
        1. A.5.1. Tool: file
        2. A.5.2. Tool: ldd
        3. A.5.3. Tool: nm
        4. A.5.4. Tool: objdump
        5. A.5.5. Tool: od
        6. A.5.6. Tool: stat
        7. A.5.7. Tool: readelf
        8. A.5.8. Tool: strings
      6. A.6. Kernel
        1. A.6.1. Tool: KDB
        2. A.6.2. Tool: KGDB
        3. A.6.3. Tool: ksymoops
      7. A.7. Miscellaneous
        1. A.7.1. Tool: VMWare Workstation
        2. A.7.2. Tool: VNC Server
        3. A.7.3. Tool: VNC Viewer
    16. B. Data Collection Script
      1. B.1. Overview
        1. B.1.1. -thorough
        2. B.1.2. -perf, -hang <pid>, -trap, -error <cmd>
      2. B.2. Running the Script
      3. B.3. The Script Source
      4. B.4. Disclaimer