You are previewing Red Hat Enterprise Linux Troubleshooting Guide.
O'Reilly logo
Red Hat Enterprise Linux Troubleshooting Guide

Book Description

Identify, capture and resolve common issues faced by Red Hat Enterprise Linux administrators using best practices and advanced troubleshooting techniques

About This Book

  • Develop a strong understanding of the base tools available within Red Hat Enterprise Linux (RHEL) and how to utilize these tools to troubleshoot and resolve real-world issues

  • Gain hidden tips and techniques to help you quickly detect the reason for poor network/storage performance

  • Troubleshoot your RHEL to isolate problems using this example-oriented guide full of real-world solutions

  • Who This Book Is For

    If you have a basic knowledge of Linux from administration or consultant experience and wish to add to your Red Hat Enterprise Linux troubleshooting skills, then this book is ideal for you. The ability to navigate and use basic Linux commands is expected.

    What You Will Learn

  • Identify issues that need rapid resolution against long term root cause analysis

  • Discover commands for testing network connectivity such as telnet, netstat, ping, ip and curl

  • Spot performance issues with commands such as top, ps, free, iostat, and vmstat

  • Use tcpdump for traffic analysis

  • Repair a degraded file system and rebuild a software raid

  • Identify and troubleshoot hardware issues using dmesg

  • Troubleshoot custom applications with strace and knowledge of Linux resource limitations

  • In Detail

    Red Hat Enterprise Linux is an operating system that allows you to modernize your infrastructure, boost efficiency through virtualization, and finally prepare your data center for an open, hybrid cloud IT architecture. It provides the stability to take on today's challenges and the flexibility to adapt to tomorrow's demands.

    In this book, you begin with simple troubleshooting best practices and get an overview of the Linux commands used for troubleshooting. The book will cover the troubleshooting methods for web applications and services such as Apache and MySQL. Then, you will learn to identify system performance bottlenecks and troubleshoot network issues; all while learning about vital troubleshooting steps such as understanding the problem statement, establishing a hypothesis, and understanding trial, error, and documentation. Next, the book will show you how to capture and analyze network traffic, use advanced system troubleshooting tools such as strace, tcpdump & dmesg, and discover common issues with system defaults.

    Finally, the book will take you through a detailed root cause analysis of an unexpected reboot where you will learn to recover a downed system.

    Style and approach

    This is an easy-to-follow guide packed with examples of real-world core Linux concepts. All the topics are presented in detail while you're performing the actual troubleshooting steps.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

    Table of Contents

    1. Red Hat Enterprise Linux Troubleshooting Guide
      1. Table of Contents
      2. Red Hat Enterprise Linux Troubleshooting Guide
      3. Credits
      4. About the Author
      5. About the Reviewers
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Troubleshooting Best Practices
        1. Styles of troubleshooting
          1. The Data Collector
          2. The Educated Guesser
          3. The Adaptor
          4. Choosing the appropriate style
        2. Troubleshooting steps
          1. Understanding the problem statement
            1. Asking questions
              1. Tickets
              2. Humans
            2. Attempting to duplicate the issue
            3. Running investigatory commands
          2. Establishing a hypothesis
            1. Putting together patterns
            2. Is this something that I've encountered before?
          3. Trial and error
            1. Start by creating a backup
          4. Getting help
            1. Books
            2. Team Wikis or Runbooks
            3. Google
            4. Man pages
              1. Reading a man page
                1. Name
                2. Synopsis
                3. Description
                4. Examples
                5. Additional sections
              2. Info documentation
              3. Referencing more than commands
              4. Installing man pages
            5. Red Hat kernel docs
            6. People
              1. Following up
          5. Documentation
        3. Root cause analysis
          1. The anatomy of a good RCA
            1. The problem as it was reported
            2. The actual root cause of the problem
          2. A timeline of events and actions taken
            1. Any key data points to validate the root cause
          3. A plan of action to prevent the incident from reoccurring
          4. Establishing a root cause
            1. Sometimes you must sacrifice a root cause analysis
        4. Understanding your environment
        5. Summary
      9. 2. Troubleshooting Commands and Sources of Useful Information
        1. Finding useful information
          1. Log files
            1. The default location
            2. Common log files
            3. Finding logs that are not in the default location
              1. Checking syslog configuration
              2. Checking the application's configuration
                1. Other examples
              3. Using the find command
          2. Configuration files
            1. Default system configuration directory
            2. Finding configuration files
              1. Using the rpm command
              2. Using the find command
          3. The proc filesystem
        2. Troubleshooting commands
          1. Command-line basics
            1. Command flags
            2. The piping command output
          2. Gathering general information
            1. w – show who is logged on and what they are doing
            2. rpm – RPM package manager
              1. Listing all packages installed
              2. Listing all files deployed by a package
              3. Using package verification
            3. df – report file system space usage
              1. Showing available inodes
            4. free – display memory utilization
              1. What is free, is not always free
              2. The /proc/meminfo file
            5. ps – report a snapshot of current running processes
              1. Printing every process in long format
              2. Printing a specific user's processes
              3. Printing a process by process ID
              4. Printing processes with performance information
          3. Networking
            1. ip – show and manipulate network settings
              1. Show IP address configuration for a specific device
              2. Show routing configuration
              3. Show network statistics for a specified device
            2. netstat – network statistics
              1. Printing network connections
              2. Printing all ports listening for tcp connections
              3. Delay
          4. Performance
            1. iotop – a simple top-like I/O monitor
            2. iostat – report I/O and CPU statistics
              1. Manipulating the output
            3. vmstat – report virtual memory statistics
            4. sar – collect, report, or save system activity information
              1. Using the sar command
        3. Summary
      10. 3. Troubleshooting a Web Application
        1. A small back story
        2. The reported issue
        3. Data gathering
          1. Asking questions
          2. Duplicating the issue
          3. Understanding the environment
            1. Where is this blog hosted?
              1. Lookup IPs with nslookup
              2. What about ping, dig, or other tools?
            2. Ok, it's within our environment; now what?
            3. What services are installed and running?
              1. Validate the web server
              2. Validating the database service
              3. Validating PHP
                1. A summary of installed and running services
          4. Looking for error messages
            1. Apache logs
              1. Finding the location of Apache's logs
              2. Reviewing the logs
                1. Using curl to call our web application
                2. Requesting a non-PHP page
                3. Reviewing generated log entries
              3. What we learned from httpd logs
          5. Verifying the database
            1. Verifying the WordPress database
              1. Finding the installation path for WordPress
                1. Checking the default configuration
              2. Finding the database credentials
                1. Connecting as the WordPress user
                2. Validating the database structure
              3. What we learned from the database validation
        4. Establishing a hypothesis
        5. Resolving the issue
          1. Understanding database data files
          2. Finding the MariaDB data folder
          3. Resolving data file issues
            1. Validating
          4. Final validation
        6. Summary
      11. 4. Troubleshooting Performance Issues
        1. Performance issues
          1. It's slow
        2. Performance
          1. Application
          2. CPU
            1. Top – a single command to look at everything
              1. What does this output tell us about our issue?
              2. Individual processes from top
            2. Determining the number of CPUs available
              1. Threads and Cores
              2. lscpu – Another way to look at CPU info
            3. ps – Drill down deeper on individual processes with ps
              1. Using ps to determine process CPU utilization
          3. Putting it all together
            1. A quick look with top
              1. Digging deeper with ps
          4. Memory
            1. free – Looking at free and used memory
              1. Linux memory buffers and caches
              2. Swapped memory
              3. What free tells us about our system
            2. Checking for oomkill
            3. ps - Checking individual processes memory utilization
            4. vmstat – Monitoring memory allocation and swapping
            5. Putting it all together
              1. Taking a look at the system's memory utilization with free
              2. Watch what is happening with vmstat
              3. Finding the processes that utilize the most memory with ps
          5. Disk
            1. iostat – CPU and device input/output statistics
              1. CPU details
              2. Reviewing I/O statistics
              3. Identifying devices
            2. Who is writing to these devices?
              1. ps – Using ps to identify processes utilizing I/O
            3. iotop – A top top-like command for disk i/o
            4. Putting it all together
              1. Using iostat to determine whether there is a I/O bandwidth problem
              2. Using iotop to determine which processes are consuming disk bandwidth
              3. Using ps to understand more about processes
          6. Network
            1. ifstat – Review interface statistics
          7. Quick review of what we have identified
        3. Comparing historical metrics
          1. sar – System activity report
            1. CPU
            2. Memory
            3. Disk
            4. Network
          2. Review what we learned by comparing historical statistics
        4. Summary
      12. 5. Network Troubleshooting
        1. Database connectivity issues
        2. Data collection
          1. Duplicating the issue
          2. Finding the database server
          3. Testing connectivity
            1. Telnet from
            2. Telnet from our laptop
          4. Ping
          5. Troubleshooting DNS
            1. Checking DNS with dig
            2. Looking up DNS with nslookup
            3. What did dig and nslookup tell us?
              1. A bit about /etc/hosts
            4. DNS summary
          6. Pinging from another location
          7. Testing port connectivity with cURL
          8. Showing current network connections with netstat
            1. Using netstat to watch for new connections
            2. Breakdown of netstat states
          9. Capturing network traffic with tcpdump
            1. Taking a look at the server's network interfaces
              1. What is a network interface?
              2. Viewing device configuration
            2. Specifying the interface with tcpdump
            3. Reading the captured data
            4. A quick primer on TCP
              1. Types of TCP packet
          10. Reviewing collected data
          11. Taking a look on the other side
            1. Identifying the network configuration
            2. Testing connectivity from
            3. Looking for connections with netstat
            4. Tracing network connections with tcpdump
          12. Routing
            1. Viewing the routing table
              1. The default route
            2. Utilizing IP to show the routing table
            3. Looking for routing misconfigurations
              1. More specific routes win
        3. Hypothesis
        4. Trial and error
          1. Removing the invalid route
          2. Configuration files
        5. Summary
      13. 6. Diagnosing and Correcting Firewall Issues
        1. Diagnosing firewalls
        2. Déjà vu
        3. Troubleshooting from historic issues
        4. Basic troubleshooting
          1. Validating the MariaDB service
          2. Troubleshooting with tcpdump
          3. Understanding ICMP
            1. Understanding connection rejections
        5. A quick summary of what you have learned so far
        6. Managing the Linux firewall with iptables
          1. Verify that iptables is running
          2. Show iptables rules being enforced
          3. Understanding iptables rules
            1. Ordering matters
            2. Default policies
            3. Breaking down the iptables rules
            4. Putting the rules together
            5. Viewing iptables counters
            6. Correcting the iptables rule ordering
              1. How iptables rules are applied
              2. Modifying iptables rules
              3. Testing our changes
        7. Summary
      14. 7. Filesystem Errors and Recovery
        1. Diagnosing filesystem errors
          1. Read-only filesystems
          2. Using the mount command to list mounted filesystems
            1. A mounted filesystem
            2. Using fdisk to list available partitions
            3. Back to troubleshooting
        2. NFS – Network Filesystem
          1. NFS and network connectivity
          2. Using the showmount command
          3. NFS server configuration
            1. Exploring /etc/exports
            2. Identifying the current exports
            3. Testing NFS from another client
        3. Making mounts permanent
          1. Unmounting the /mnt filesystem
        4. Troubleshooting the NFS server, again
          1. Finding the NFS log messages
          2. Reading /var/log/messages
          3. Read-only filesystems
            1. Identifying disk issues
        5. Recovering the filesystem
          1. Unmounting the filesystem
          2. Filesystem checks with fsck
            1. The fsck and xfs filesystems
            2. How do these tools repair a filesystem?
          3. Mounting the filesystem
          4. Repairing the other filesystems
            1. Recovering the / (root) filesystem
        6. Validation
        7. Summary
      15. 8. Hardware Troubleshooting
        1. Starting with a log entry
        2. What is a RAID?
          1. RAID 0 – striping
          2. RAID 1 – mirroring
          3. RAID 5 – striping with distributed parity
          4. RAID 6 – striping with double distributed parity
          5. RAID 10 – mirrored and striped
        3. Back to troubleshooting our RAID
          1. How RAID recovery works
          2. Checking the current RAID status
            1. Summarizing the key information
          3. Looking at md status with /proc/mdstat
            1. Using both /proc/mdstat and mdadm
        4. Identifying a bigger issue
        5. Understanding /dev
          1. More than just disk drives
        6. Device messages with dmesg
          1. Summarizing what dmesg has provided
        7. Using mdadm to examine the superblock
          1. Checking /dev/sdb2
        8. What we have learned so far
        9. Re-adding the drives to the arrays
          1. Adding a new disk device
          2. When disks are not added cleanly
          3. Another way to watch the rebuild status
        10. Summary
      16. 9. Using System Tools to Troubleshoot Applications
        1. Open source versus home-grown applications
        2. When the application won't start
          1. Exit codes
          2. Is the script failing, or the application?
          3. A wealth of information in the configuration file
            1. Watching log files during startup
        3. Checking whether the application is already running
          1. Checking open files
            1. Understanding file descriptors
          2. Getting back to the lsof output
          3. Using lsof to check whether we have a previously running process
        4. Finding out more about the application
          1. Tracing an application with strace
            1. What is a system call?
          2. Using strace to identify why the application will not start
        5. Resolving the conflict
        6. Summary
      17. 10. Understanding Linux User and Kernel Limits
        1. A reported issue
        2. Why is the job failing?
          1. Background questions
          2. Is the cron job even running?
          3. User crontabs
        3. Understanding user limits
          1. The file size limit
          2. The max user processes limit
          3. The open files limit
        4. Changing user limits
          1. The limits.conf file
            1. Future proofing the scheduled job
          2. Running the job again
        5. Kernel tunables
          1. Finding the kernel parameter for open files
          2. Changing kernel tunables
            1. Permanently changing a tunable
            2. Temporarily changing a tunable
          3. Running the job one last time
        6. A look back
          1. Too many open files
          2. A bit of clean up
        7. Summary
      18. 11. Recovering from Common Failures
        1. The reported problem
          1. Is Apache really down?
          2. Why is it down?
          3. What else was happening at that time?
            1. Searching the messages log
            2. Breaking down this useful one-liner
              1. The cut command
              2. The sort command
            3. The uniq command
            4. Tying it all together
          4. What happens when a Linux system runs out of memory?
            1. Minimum free memory
              1. A quick recap
            2. How oom-kill works
              1. Adjusting the oom score
          5. Determining whether our process was killed by oom-kill
          6. Why did the system run out of memory?
        2. Resolving the issue in the long-term and short-term
          1. Long-term resolution
          2. Short-term resolution
        3. Summary
      19. 12. Root Cause Analysis of an Unexpected Reboot
        1. A late night alert
        2. Identifying the issue
          1. Did someone reboot this server?
          2. What do the logs tell us?
          3. Learning about new processes and services
        3. What caused the high load average?
          1. What are the run queue and load average?
            1. Load average
        4. Investigating the filesystem being full
          1. The du command
          2. Why wasn't the queue directory processed?
          3. A checkpoint on what you learned
            1. Sometimes you cannot prove everything
        5. Preventing reoccurrence
          1. Immediate action
          2. Long-term actions
        6. A sample Root Cause Analysis
          1. Problem summary
          2. Problem details
          3. Root cause
          4. Action plan
            1. Further actions to be taken
        7. Summary
      20. Index