Cover image for Understanding the Linux Kernel, Second Edition

Book description

To thoroughly understand what makes Linux tick and why it's so efficient, you need to delve deep into the heart of the operating system--into the Linux kernel itself. The kernel is Linux--in the case of the Linux operating system, it's the only bit of software to which the term "Linux" applies. The kernel handles all the requests or completed I/O operations and determines which programs will share its processing time, and in what order. Responsible for the sophisticated memory management of the whole system, the Linux kernel is the force behind the legendary Linux efficiency. The new edition of Understanding the Linux Kernel takes you on a guided tour through the most significant data structures, many algorithms, and programming tricks used in the kernel. Probing beyond the superficial features, the authors offer valuable insights to people who want to know how things really work inside their machine. Relevant segments of code are dissected and discussed line by line. The book covers more than just the functioning of the code, it explains the theoretical underpinnings for why Linux does things the way it does. The new edition of the book has been updated to cover version 2.4 of the kernel, which is quite different from version 2.2: the virtual memory system is entirely new, support for multiprocessor systems is improved, and whole new classes of hardware devices have been added. The authors explore each new feature in detail. Other topics in the book include:

  • Memory management including file buffering, process swapping, and Direct memory Access (DMA)

  • The Virtual Filesystem and the Second Extended Filesystem

  • Process creation and scheduling

  • Signals, interrupts, and the essential interfaces to device drivers

  • Timing

  • Synchronization in the kernel

  • Interprocess Communication (IPC)

  • Program execution

Understanding the Linux Kernel, Second Edition will acquaint you with all the inner workings of Linux, but is more than just an academic exercise. You'll learn what conditions bring out Linux's best performance, and you'll see how it meets the challenge of providing good system response during process scheduling, file access, and memory management in a wide variety of environments. If knowledge is power, then this book will help you make the most of your Linux system.

Table of Contents

  1. Understanding the Linux Kernel, 2nd Edition
    1. Preface
      1. The Audience for This Book
      2. Organization of the Material
        1. Level of Description
      3. Overview of the Book
      4. Background Information
      5. Conventions in This Book
      6. How to Contact Us
      7. Acknowledgments
    2. 1. Introduction
      1. Linux Versus Other Unix-Like Kernels
      2. Hardware Dependency
      3. Linux Versions
      4. Basic Operating System Concepts
        1. Multiuser Systems
        2. Users and Groups
        3. Processes
        4. Kernel Architecture
      5. An Overview of the Unix Filesystem
        1. Files
        2. Hard and Soft Links
        3. File Types
        4. File Descriptor and Inode
        5. Access Rights and File Mode
        6. File-Handling System Calls
          1. Opening a file
          2. Accessing an opened file
          3. Closing a file
          4. Renaming and deleting a file
      6. An Overview of Unix Kernels
        1. The Process/Kernel Model
        2. Process Implementation
        3. Reentrant Kernels
        4. Process Address Space
        5. Synchronization and Critical Regions
          1. Nonpreemptive kernels
          2. Interrupt disabling
          3. Semaphores
          4. Spin locks
          5. Avoiding deadlocks
        6. Signals and Interprocess Communication
        7. Process Management
          1. Zombie processes
          2. Process groups and login sessions
        8. Memory Management
          1. Virtual memory
          2. Random access memory usage
          3. Kernel Memory Allocator
          4. Process virtual address space handling
          5. Swapping and caching
        9. Device Drivers
    3. 2. Memory Addressing
      1. Memory Addresses
      2. Segmentation in Hardware
        1. Segmentation Registers
        2. Segment Descriptors
        3. Fast Access to Segment Descriptors
        4. Segmentation Unit
      3. Segmentation in Linux
      4. Paging in Hardware
        1. Regular Paging
        2. Extended Paging
        3. Hardware Protection Scheme
        4. An Example of Regular Paging
        5. Three-Level Paging
        6. The Physical Address Extension (PAE) Paging Mechanism
        7. Hardware Cache
        8. Translation Lookaside Buffers (TLB)
      5. Paging in Linux
        1. The Linear Address Fields
        2. Page Table Handling
        3. Reserved Page Frames
        4. Process Page Tables
        5. Kernel Page Tables
          1. Provisional kernel Page Tables
          2. Final kernel Page Table when RAM size is less than 896 MB
          3. Final kernel Page Table when RAM size is between 896 MB and 4096 MB
          4. Final kernel Page Table when RAM size is more than 4096 MB
        6. Fix-Mapped Linear Addresses
        7. Handling the Hardware Cache and the TLB
          1. Handling the hardware cache
          2. Handling the TLB
    4. 3. Processes
      1. Processes, Lightweight Processes, and Threads
      2. Process Descriptor
        1. Process State
        2. Identifying a Process
          1. Processor descriptors handling
          2. The current macro
          3. The process list
          4. Doubly linked lists
          5. The list of TASK_RUNNING processes
          6. The pidhash table and chained lists
        3. Parenthood Relationships Among Processes
        4. How Processes Are Organized
          1. Wait queues
          2. Handling wait queues
        5. Process Resource Limits
      3. Process Switch
        1. Hardware Context
        2. Task State Segment
          1. The thread field
        3. Performing the Process Switch
        4. Saving the FPU, MMX, and XMM Registers
      4. Creating Processes
        1. The clone( ), fork( ), and vfork( ) System Calls
        2. Kernel Threads
          1. Creating a kernel thread
          2. Process 0
          3. Process 1
          4. Other kernel threads
      5. Destroying Processes
        1. Process Termination
        2. Process Removal
    5. 4. Interrupts and Exceptions
      1. The Role of Interrupt Signals
      2. Interrupts and Exceptions
        1. IRQs and Interrupts
          1. The Advanced Programmable Interrupt Controller (APIC)
        2. Exceptions
        3. Interrupt Descriptor Table
        4. Hardware Handling of Interrupts and Exceptions
      3. Nested Execution of Exception and Interrupt Handlers
      4. Initializing the Interrupt Descriptor Table
        1. Interrupt, Trap, and System Gates
        2. Preliminary Initialization of the IDT
      5. Exception Handling
        1. Saving the Registers for the Exception Handler
        2. Entering and Leaving the Exception Handler
      6. Interrupt Handling
        1. I/O Interrupt Handling
          1. Interrupt vectors
          2. IRQ data structures
          3. IRQ distribution in multiprocessor systems
          4. Saving the registers for the interrupt handler
          5. The do_IRQ( ) function
          6. Reviving a lost interrupt
          7. Interrupt service routines
          8. Dynamic allocation of IRQ lines
        2. Interprocessor Interrupt Handling
      7. Softirqs, Tasklets, and Bottom Halves
        1. Softirqs
          1. The softirq kernel threads
        2. Tasklets
        3. Bottom Halves
          1. Extending a bottom half
      8. Returning from Interrupts and Exceptions
        1. The ret_ from_exception( ) Function
        2. The ret_ from_intr( ) Function
        3. The ret_ from_sys_call( ) Function
        4. The ret_ from_ fork( ) Function
    6. 5. Kernel Synchronization
      1. Kernel Control Paths
      2. When Synchronization Is Not Necessary
      3. Synchronization Primitives
        1. Atomic Operations
        2. Memory Barriers
        3. Spin Locks
        4. Read/Write Spin Locks
          1. Getting and releasing a lock for reading
          2. Getting and releasing a lock for writing
        5. The Big Reader Lock
        6. Semaphores
          1. Getting and releasing semaphores
        7. Read/Write Semaphores
        8. Completions
        9. Local Interrupt Disabling
        10. Global Interrupt Disabling
        11. Disabling Deferrable Functions
      4. Synchronizing Accesses to Kernel Data Structures
        1. Choosing Among Spin Locks, Semaphores, and Interrupt Disabling
          1. Protecting a data structure accessed by exceptions
          2. Protecting a data structure accessed by interrupts
          3. Protecting a data structure accessed by deferrable functions
          4. Protecting a data structure accessed by exceptions and interrupts
          5. Protecting a data structure accessed by exceptions and deferrable functions
          6. Protecting a data structure accessed by interrupts and deferrable functions
          7. Protecting a data structure accessed by exceptions, interrupts, and deferrable functions
      5. Examples of Race Condition Prevention
        1. Reference Counters
        2. The Global Kernel Lock
        3. Memory Descriptor Read/Write Semaphore
        4. Slab Cache List Semaphore
        5. Inode Semaphore
    7. 6. Timing Measurements
      1. Hardware Clocks
        1. Real Time Clock
        2. Time Stamp Counter
        3. Programmable Interval Timer
        4. CPU Local Timers
      2. The Linux Timekeeping Architecture
        1. Timekeeping Architecture in Uniprocessor Systems
          1. PIT’s interrupt service routine
          2. The TIMER_BH bottom half
        2. Timekeeping Architecture in Multiprocessor Systems
          1. Initialization of the timekeeping architecture
          2. The local timer interrupt handler
      3. CPU’s Time Sharing
      4. Updating the Time and Date
      5. Updating System Statistics
        1. Checking the Current Process CPU Resource Limit
        2. Keeping Track of System Load
        3. Profiling the Kernel Code
        4. Checking the NMI Watchdogs
      6. Software Timers
        1. Dynamic Timers
          1. Dynamic timers and race conditions
          2. Dynamic timers handling
        2. An Application of Dynamic Timers
      7. System Calls Related to Timing Measurements
        1. The time( ), ftime( ), and gettimeofday( ) System Calls
        2. The adjtimex( ) System Call
        3. The setitimer( ) and alarm( ) System Calls
    8. 7. Memory Management
      1. Page Frame Management
        1. Page Descriptors
        2. Memory Zones
        3. Non-Uniform Memory Access (NUMA)
        4. Initialization of the Memory Handling Data Structures
        5. Requesting and Releasing Page Frames
        6. Kernel Mappings of High-Memory Page Frames
          1. Permanent kernel mappings
          2. Temporary kernel mappings
        7. The Buddy System Algorithm
          1. Data structures
          2. Allocating a block
          3. Freeing a block
      2. Memory Area Management
        1. The Slab Allocator
        2. Cache Descriptor
        3. Slab Descriptor
        4. General and Specific Caches
        5. Interfacing the Slab Allocator with the Buddy System
        6. Allocating a Slab to a Cache
        7. Releasing a Slab from a Cache
        8. Object Descriptor
        9. Aligning Objects in Memory
        10. Slab Coloring
        11. Local Array of Objects in Multiprocessor Systems
        12. Allocating an Object in a Cache
          1. The uniprocessor case
          2. The multiprocessor case
        13. Releasing an Object from a Cache
          1. The uniprocessor case
          2. The multiprocessor case
        14. General Purpose Objects
      3. Noncontiguous Memory Area Management
        1. Linear Addresses of Noncontiguous Memory Areas
        2. Descriptors of Noncontiguous Memory Areas
        3. Allocating a Noncontiguous Memory Area
        4. Releasing a Noncontiguous Memory Area
    9. 8. Process Address Space
      1. The Process’s Address Space
      2. The Memory Descriptor
        1. Memory Descriptor of Kernel Threads
      3. Memory Regions
        1. Memory Region Data Structures
        2. Memory Region Access Rights
        3. Memory Region Handling
          1. Finding the closest region to a given address: find_vma( )
          2. Finding a region that overlaps a given interval: find_vma_intersection( )
          3. Finding a free interval: arch_get_unmapped_area( )
          4. Inserting a region in the memory descriptor list: insert_vm_struct( )
        4. Allocating a Linear Address Interval
        5. Releasing a Linear Address Interval
          1. First phase: scanning the memory regions
          2. Second phase: updating the Page Tables
      4. Page Fault Exception Handler
        1. Handling a Faulty Address Outside the Address Space
        2. Handling a Faulty Address Inside the Address Space
        3. Demand Paging
        4. Copy On Write
        5. Handling Noncontiguous Memory Area Accesses
      5. Creating and Deleting a Process Address Space
        1. Creating a Process Address Space
        2. Deleting a Process Address Space
      6. Managing the Heap
    10. 9. System Calls
      1. POSIX APIs and System Calls
      2. System Call Handler and Service Routines
        1. Initializing System Calls
        2. The system_call( ) Function
        3. Parameter Passing
        4. Verifying the Parameters
        5. Accessing the Process Address Space
        6. Dynamic Address Checking: The Fixup Code
          1. The exception tables
          2. Generating the exception tables and the fixup code
      3. Kernel Wrapper Routines
    11. 10. Signals
      1. The Role of Signals
        1. Actions Performed upon Delivering a Signal
        2. Data Structures Associated with Signals
        3. Operations on Signal Data Structures
      2. Generating a Signal
        1. The send_sig_info( ) and send_sig( ) Functions
        2. The force_sig_info( ) and force_sig( ) Functions
      3. Delivering a Signal
        1. Ignoring the Signal
        2. Executing the Default Action for the Signal
        3. Catching the Signal
          1. Setting up the frame
          2. Evaluating the signal flags
          3. Starting the signal handler
          4. Terminating the signal handler
        4. Reexecution of System Calls
      4. System Calls Related to Signal Handling
        1. The kill( ) System Call
        2. Changing a Signal Action
        3. Examining the Pending Blocked Signals
        4. Modifying the Set of Blocked Signals
        5. Suspending the Process
        6. System Calls for Real-Time Signals
    12. 11. Process Scheduling
      1. Scheduling Policy
        1. Process Preemption
        2. How Long Must a Quantum Last?
      2. The Scheduling Algorithm
        1. Data Structures Used by the Scheduler
          1. Process descriptor
          2. CPU’s data structures
        2. The schedule( ) Function
          1. Direct invocation
          2. Lazy invocation
          3. Actions performed by schedule( ) before a process switch
          4. Actions performed by schedule( ) after a process switch
          5. How good is a runnable process?
          6. Scheduling on multiprocessor systems
        3. Performance of the Scheduling Algorithm
          1. The algorithm does not scale well
          2. The predefined quantum is too large for high system loads
          3. I/O-bound process boosting strategy is not optimal
          4. Support for real-time applications is weak
      3. System Calls Related to Scheduling
        1. The nice( ) System Call
        2. The getpriority( ) and setpriority( ) System Calls
        3. System Calls Related to Real-Time Processes
          1. The sched_getscheduler( ) and sched_setscheduler( ) system calls
          2. The sched_ getparam( ) and sched_setparam( ) system calls
          3. The sched_ yield( ) system call
          4. The sched_ get_priority_min( ) and sched_ get_priority_max( ) system calls
          5. The sched_rr_ get_interval( ) system call
    13. 12. The Virtual Filesystem
      1. The Role of the Virtual Filesystem (VFS)
        1. The Common File Model
        2. System Calls Handled by the VFS
      2. VFS Data Structures
        1. Superblock Objects
        2. Inode Objects
        3. File Objects
        4. dentry Objects
        5. The dentry Cache
        6. Files Associated with a Process
      3. Filesystem Types
        1. Special Filesystems
        2. Filesystem Type Registration
      4. Filesystem Mounting
        1. Mounting the Root Filesystem
        2. Mounting a Generic Filesystem
        3. Unmounting a Filesystem
      5. Pathname Lookup
        1. Standard Pathname Lookup
        2. Parent Pathname Lookup
        3. Lookup of Symbolic Links
      6. Implementations of VFS System Calls
        1. The open( ) System Call
        2. The read( ) and write( ) System Calls
        3. The close( ) System Call
      7. File Locking
        1. Linux File Locking
        2. File-Locking Data Structures
        3. FL_FLOCK Locks
        4. FL_POSIX Locks
    14. 13. Managing I/O Devices
      1. I/O Architecture
        1. I/O Ports
          1. Accessing I/O ports
        2. I/O Interfaces
          1. Custom I/O interfaces
          2. General-purpose I/O interfaces
        3. Device Controllers
          1. Mapping addresses of I/O shared memory
          2. Accessing the I/O shared memory
        4. Direct Memory Access (DMA)
          1. Putting DMA to work
      2. Device Files
        1. Old-Style Device Files
        2. Devfs Device Files
        3. VFS Handling of Device Files
      3. Device Drivers
        1. Levels of Kernel Support
        2. Buffering Strategies of Device Drivers
        3. Registering a Device Driver
        4. Initializing a Device Driver
        5. Monitoring I/O Operations
          1. Polling mode
          2. Interrupt mode
      4. Block Device Drivers
        1. Keeping Track of Block Device Drivers
        2. Initializing a Block Device Driver
        3. Sectors, Blocks, and Buffers
        4. Buffer Heads
        5. An Overview of Block Device Driver Architecture
          1. Request descriptors
          2. Request queue descriptors
          3. Block device low-level driver descriptor
        6. The ll_rw_block( ) Function
          1. Scheduling the activation of the strategy routine
          2. Extending the request queue
        7. Low-Level Request Handling
        8. Block and Page I/O Operations
          1. Block I/O operations
          2. Page I/O operations
      5. Character Device Drivers
    15. 14. Disk Caches
      1. The Page Cache
        1. The address_space Object
        2. Page Cache Data Structures
          1. The page hash table
          2. The lists of page descriptors in the address_space object
          3. Page descriptor fields related to the page cache
        3. Page Cache Handling Functions
      2. The Buffer Cache
        1. Buffer Head Data Structures
          1. The list of unused buffer heads
          2. Lists of buffer heads for cached buffers
          3. The hash table of cached buffer heads
          4. Buffer usage counter
        2. Buffer Pages
          1. Allocating buffer pages
        3. The getblk( ) Function
        4. Writing Dirty Buffers to Disk
          1. The bdflush kernel thread
          2. The kupdate kernel thread
          3. The sync( ), fsync( ), and fdatasync( ) system calls
    16. 15. Accessing Files
      1. Reading and Writing a File
        1. Reading from a File
          1. The readpage method for regular files
          2. The readpage method for block device files
        2. Read-Ahead of Files
          1. The accessed page is locked (synchronous read-ahead)
          2. The accessed page is unlocked (asynchronous read-ahead)
        3. Writing to a File
          1. The prepare_write and commit_write methods for regular files
          2. The prepare_write and commit_write methods for block device files
      2. Memory Mapping
        1. Memory Mapping Data Structures
        2. Creating a Memory Mapping
        3. Destroying a Memory Mapping
        4. Demand Paging for Memory Mapping
        5. Flushing Dirty Memory Mapping Pages to Disk
      3. Direct I/O Transfers
    17. 16. Swapping: Methods for Freeing Memory
      1. What Is Swapping?
        1. Which Kind of Page to Swap Out
        2. How to Distribute Pages in the Swap Areas
        3. How to Select the Page to Be Swapped Out
        4. When to Perform Page Swap Out
      2. Swap Area
        1. Swap Area Descriptor
        2. Swapped-Out Page Identifier
        3. Activating and Deactivating a Swap Area
          1. The sys_swapon( ) service routine
          2. The sys_swapoff( ) service routine
          3. The try_to_unuse( ) function
        4. Allocating and Releasing a Page Slot
          1. The scan_swap_map( ) function
          2. The get_swap_page( ) function
          3. The swap_free( ) function
      3. The Swap Cache
        1. Swap Cache Helper Functions
      4. Transferring Swap Pages
        1. The rw_swap_ page( ) Function
        2. The read_swap_cache_async( ) Function
        3. The rw_swap_ page_nolock( ) Function
      5. Swapping Out Pages
        1. The try_to_swap_out( ) Function
      6. Swapping in Pages
        1. The do_swap_page( ) Function
      7. Reclaiming Page Frame
        1. Outline of the Page Frame Reclaiming Algorithm
        2. The Least Recently Used (LRU) Lists
          1. Moving pages across the LRU lists
        3. The try_to_ free_ pages( ) Function
        4. The shrink_caches( ) Function
        5. The shrink_cache( ) Function
        6. Reclaiming Page Frames from the Dentry and Inode Caches
          1. Reclaiming page frames from the dentry cache
          2. Reclaiming page frames from the inode cache
        7. The kswapd Kernel Thread
    18. 17. The Ext2 and Ext3 Filesystems
      1. General Characteristics of Ext2
      2. Ext2 Disk Data Structures
        1. Superblock
        2. Group Descriptor and Bitmap
        3. Inode Table
        4. How Various File Types Use Disk Blocks
          1. Regular file
          2. Directory
          3. Symbolic link
          4. Device file, pipe, and socket
      3. Ext2 Memory Data Structures
        1. The ext2_sb_info and ext2_inode_info Structures
        2. Bitmap Caches
      4. Creating the Ext2 Filesystem
      5. Ext2 Methods
        1. Ext2 Superblock Operations
        2. Ext2 Inode Operations
        3. Ext2 File Operations
      6. Managing Ext2 Disk Space
        1. Creating Inodes
        2. Deleting Inodes
        3. Data Blocks Addressing
        4. File Holes
        5. Allocating a Data Block
        6. Releasing a Data Block
      7. The Ext3 Filesystem
        1. Journaling Filesystems
        2. The Ext3 Journaling Filesystem
        3. The Journaling Block Device Layer
          1. Log records
          2. Atomic operation handles
          3. Transactions
        4. How Journaling Works
    19. 18. Networking
      1. Main Networking Data Structures
        1. Network Architectures
        2. Network Interface Cards
        3. BSD Sockets
        4. INET Sockets
        5. The Destination Cache
        6. Routing Data Structures
          1. The Forwarding Information Base (FIB)
          2. The routing cache
          3. The neighbor cache
        7. The Socket Buffer
      2. System Calls Related to Networking
        1. The socket( ) System Call
          1. Socket initialization
          2. Socket’s files
        2. The bind( ) System Call
        3. The connect( ) System Call
        4. Writing Packets to a Socket
          1. Transport layer: the udp_sendmsg( ) function
          2. Transport and network layers: the ip_build_xmit( ) function
          3. Data link layer: composing the hardware header
          4. Data link layer: enqueueing the socket buffer for transmission
      3. Sending Packets to the Network Card
      4. Receiving Packets from the Network Card
    20. 19. Process Communication
      1. Pipes
        1. Using a Pipe
        2. Pipe Data Structures
          1. The pipefs special filesystem
        3. Creating and Destroying a Pipe
        4. Reading from a Pipe
        5. Writing into a Pipe
      2. FIFOs
        1. Creating and Opening a FIFO
      3. System V IPC
        1. Using an IPC Resource
        2. The ipc( ) System Call
        3. IPC Semaphores
          1. Undoable semaphore operations
          2. The queue of pending requests
        4. IPC Messages
        5. IPC Shared Memory
          1. Swapping out pages of IPC shared memory regions
          2. Demand paging for IPC shared memory regions
    21. 20. Program Execution
      1. Executable Files
        1. Process Credentials and Capabilities
          1. Process capabilities
        2. Command-Line Arguments and Shell Environment
        3. Libraries
        4. Program Segments and Process Memory Regions
        5. Execution Tracing
      2. Executable Formats
      3. Execution Domains
      4. The exec Functions
    22. A. System Startup
      1. Prehistoric Age: The BIOS
      2. Ancient Age: The Boot Loader
        1. Booting Linux from Floppy Disk
        2. Booting Linux from Hard Disk
      3. Middle Ages: The setup( ) Function
      4. Renaissance: The startup_32( ) Functions
      5. Modern Age: The start_kernel( ) Function
    23. B. Modules
      1. To Be (a Module) or Not to Be?
      2. Module Implementation
        1. Module Usage Counter
        2. Exporting Symbols
        3. Module Dependency
      3. Linking and Unlinking Modules
      4. Linking Modules on Demand
        1. The modprobe Program
        2. The request_module( ) Function
    24. C. Source Code Structure
    25. 21. Bibliography
      1. Books on Unix Kernels
      2. Books on the Linux Kernel
      3. Books on PC Architecture and Technical Manuals on Intel Microprocessors
      4. Other Online Documentation Sources
    26. Index
    27. Colophon