Cover image for Understanding the Linux Kernel, 3rd Edition

Book description

In order to thoroughly understand what makes Linux tick and why it works so well on a wide variety of systems, you need to delve deep into the heart of the kernel. The kernel handles all interactions between the CPU and the external world, and determines which programs will share processor time, in what order. It manages limited memory so well that hundreds of processes can share the system efficiently, and expertly organizes data transfers so that the CPU isn't kept waiting any longer than necessary for the relatively slow disks.

The third edition of Understanding the Linux Kernel takes you on a guided tour of the most significant data structures, algorithms, and programming tricks used in the kernel. Probing beyond superficial features, the authors offer valuable insights to people who want to know how things really work inside their machine. Important Intel-specific features are discussed. Relevant segments of code are dissected line by line. But the book covers more than just the functioning of the code; it explains the theoretical underpinnings of why Linux does things the way it does.

This edition of the book covers Version 2.6, which has seen significant changes to nearly every kernel subsystem, particularly in the areas of memory management and block devices. The book focuses on the following topics:

  • Memory management, including file buffering, process swapping, and Direct memory Access (DMA)

  • The Virtual Filesystem layer and the Second and Third Extended Filesystems

  • Process creation and scheduling

  • Signals, interrupts, and the essential interfaces to device drivers

  • Timing

  • Synchronization within the kernel

  • Interprocess Communication (IPC)

  • Program execution

Understanding the Linux Kernel will acquaint you with all the inner workings of Linux, but it's more than just an academic exercise. You'll learn what conditions bring out Linux's best performance, and you'll see how it meets the challenge of providing good system response during process scheduling, file access, and memory management in a wide variety of environments. This book will help you make the most of your Linux system.

Table of Contents

  1. Understanding the Linux Kernel, 3rd Edition
  2. Preface
    1. The Audience for This Book
    2. Organization of the Material
    3. Level of Description
    4. Overview of the Book
    5. Background Information
    6. Conventions in This Book
    7. How to Contact Us
    8. Safari® Enabled
    9. Acknowledgments
  3. 1. Introduction
    1. 1.1. Linux Versus Other Unix-Like Kernels
    2. 1.2. Hardware Dependency
    3. 1.3. Linux Versions
    4. 1.4. Basic Operating System Concepts
      1. 1.4.1. Multiuser Systems
      2. 1.4.2. Users and Groups
      3. 1.4.3. Processes
      4. 1.4.4. Kernel Architecture
    5. 1.5. An Overview of the Unix Filesystem
      1. 1.5.1. Files
      2. 1.5.2. Hard and Soft Links
      3. 1.5.3. File Types
      4. 1.5.4. File Descriptor and Inode
      5. 1.5.5. Access Rights and File Mode
      6. 1.5.6. File-Handling System Calls
        1. 1.5.6.1. Opening a file
        2. 1.5.6.2. Accessing an opened file
        3. 1.5.6.3. Closing a file
        4. 1.5.6.4. Renaming and deleting a file
    6. 1.6. An Overview of Unix Kernels
      1. 1.6.1. The Process/Kernel Model
      2. 1.6.2. Process Implementation
      3. 1.6.3. Reentrant Kernels
      4. 1.6.4. Process Address Space
      5. 1.6.5. Synchronization and Critical Regions
        1. 1.6.5.1. Kernel preemption disabling
        2. 1.6.5.2. Interrupt disabling
        3. 1.6.5.3. Semaphores
        4. 1.6.5.4. Spin locks
        5. 1.6.5.5. Avoiding deadlocks
      6. 1.6.6. Signals and Interprocess Communication
      7. 1.6.7. Process Management
        1. 1.6.7.1. Zombie processes
        2. 1.6.7.2. Process groups and login sessions
      8. 1.6.8. Memory Management
        1. 1.6.8.1. Virtual memory
        2. 1.6.8.2. Random access memory usage
        3. 1.6.8.3. Kernel Memory Allocator
        4. 1.6.8.4. Process virtual address space handling
        5. 1.6.8.5. Caching
      9. 1.6.9. Device Drivers
  4. 2. Memory Addressing
    1. 2.1. Memory Addresses
    2. 2.2. Segmentation in Hardware
      1. 2.2.1. Segment Selectors and Segmentation Registers
      2. 2.2.2. Segment Descriptors
      3. 2.2.3. Fast Access to Segment Descriptors
      4. 2.2.4. Segmentation Unit
    3. 2.3. Segmentation in Linux
      1. 2.3.1. The Linux GDT
      2. 2.3.2. The Linux LDTs
    4. 2.4. Paging in Hardware
      1. 2.4.1. Regular Paging
      2. 2.4.2. Extended Paging
      3. 2.4.3. Hardware Protection Scheme
      4. 2.4.4. An Example of Regular Paging
      5. 2.4.5. The Physical Address Extension (PAE) Paging Mechanism
      6. 2.4.6. Paging for 64-bit Architectures
      7. 2.4.7. Hardware Cache
      8. 2.4.8. Translation Lookaside Buffers (TLB)
    5. 2.5. Paging in Linux
      1. 2.5.1. The Linear Address Fields
      2. 2.5.2. Page Table Handling
      3. 2.5.3. Physical Memory Layout
      4. 2.5.4. Process Page Tables
      5. 2.5.5. Kernel Page Tables
        1. 2.5.5.1. Provisional kernel Page Tables
        2. 2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB
        3. 2.5.5.3. Final kernel Page Table when RAM size is between 896 MB and 4096 MB
        4. 2.5.5.4. Final kernel Page Table when RAM size is more than 4096 MB
      6. 2.5.6. Fix-Mapped Linear Addresses
      7. 2.5.7. Handling the Hardware Cache and the TLB
        1. 2.5.7.1. Handling the hardware cache
        2. 2.5.7.2. Handling the TLB
  5. 3. Processes
    1. 3.1. Processes, Lightweight Processes, and Threads
    2. 3.2. Process Descriptor
      1. 3.2.1. Process State
      2. 3.2.2. Identifying a Process
        1. 3.2.2.1. Process descriptors handling
        2. 3.2.2.2. Identifying the current process
        3. 3.2.2.3. Doubly linked lists
        4. 3.2.2.4. The process list
        5. 3.2.2.5. The lists of TASK_RUNNING processes
      3. 3.2.3. Relationships Among Processes
        1. 3.2.3.1. The pidhash table and chained lists
      4. 3.2.4. How Processes Are Organized
        1. 3.2.4.1. Wait queues
        2. 3.2.4.2. Handling wait queues
      5. 3.2.5. Process Resource Limits
    3. 3.3. Process Switch
      1. 3.3.1. Hardware Context
      2. 3.3.2. Task State Segment
        1. 3.3.2.1. The thread field
      3. 3.3.3. Performing the Process Switch
        1. 3.3.3.1. The switch_to macro
        2. 3.3.3.2. The _ _switch_to ( ) function
      4. 3.3.4. Saving and Loading the FPU, MMX, and XMM Registers
        1. 3.3.4.1. Saving the FPU registers
        2. 3.3.4.2. Loading the FPU registers
        3. 3.3.4.3. Using the FPU, MMX, and SSE/SSE2 units in Kernel Mode
    4. 3.4. Creating Processes
      1. 3.4.1. The clone( ), fork( ), and vfork( ) System Calls
        1. 3.4.1.1. The do_fork( ) function
        2. 3.4.1.2. The copy_process( ) function
      2. 3.4.2. Kernel Threads
        1. 3.4.2.1. Creating a kernel thread
        2. 3.4.2.2. Process 0
        3. 3.4.2.3. Process 1
        4. 3.4.2.4. Other kernel threads
    5. 3.5. Destroying Processes
      1. 3.5.1. Process Termination
        1. 3.5.1.1. The do_group_exit( ) function
        2. 3.5.1.2. The do_exit( ) function
      2. 3.5.2. Process Removal
  6. 4. Interrupts and Exceptions
    1. 4.1. The Role of Interrupt Signals
    2. 4.2. Interrupts and Exceptions
      1. 4.2.1. IRQs and Interrupts
        1. 4.2.1.1. The Advanced Programmable Interrupt Controller (APIC)
      2. 4.2.2. Exceptions
      3. 4.2.3. Interrupt Descriptor Table
      4. 4.2.4. Hardware Handling of Interrupts and Exceptions
    3. 4.3. Nested Execution of Exception and Interrupt Handlers
    4. 4.4. Initializing the Interrupt Descriptor Table
      1. 4.4.1. Interrupt, Trap, and System Gates
      2. 4.4.2. Preliminary Initialization of the IDT
    5. 4.5. Exception Handling
      1. 4.5.1. Saving the Registers for the Exception Handler
      2. 4.5.2. Entering and Leaving the Exception Handler
    6. 4.6. Interrupt Handling
      1. 4.6.1. I/O Interrupt Handling
        1. 4.6.1.1. Interrupt vectors
        2. 4.6.1.2. IRQ data structures
        3. 4.6.1.3. IRQ distribution in multiprocessor systems
        4. 4.6.1.4. Multiple Kernel Mode stacks
        5. 4.6.1.5. Saving the registers for the interrupt handler
        6. 4.6.1.6. The do_IRQ( ) function
        7. 4.6.1.7. The _ _do_IRQ( ) function
        8. 4.6.1.8. Reviving a lost interrupt
        9. 4.6.1.9. Interrupt service routines
        10. 4.6.1.10. Dynamic allocation of IRQ lines
      2. 4.6.2. Interprocessor Interrupt Handling
    7. 4.7. Softirqs and Tasklets
      1. 4.7.1. Softirqs
        1. 4.7.1.1. Data structures used for softirqs
        2. 4.7.1.2. Handling softirqs
        3. 4.7.1.3. The do_softirq( ) function
        4. 4.7.1.4. The _ _do_softirq( ) function
        5. 4.7.1.5. The ksoftirqd kernel threads
      2. 4.7.2. Tasklets
    8. 4.8. Work Queues
      1. 4.8.1.
        1. 4.8.1.1. Work queue data structures
        2. 4.8.1.2. Work queue functions
        3. 4.8.1.3. The predefined work queue
    9. 4.9. Returning from Interrupts and Exceptions
      1. 4.9.1.
        1. 4.9.1.1. The entry points
        2. 4.9.1.2. Resuming a kernel control path
        3. 4.9.1.3. Checking for kernel preemption
        4. 4.9.1.4. Resuming a User Mode program
        5. 4.9.1.5. Checking for rescheduling
        6. 4.9.1.6. Handling pending signals, virtual-8086 mode, and single stepping
  7. 5. Kernel Synchronization
    1. 5.1. How the Kernel Services Requests
      1. 5.1.1. Kernel Preemption
      2. 5.1.2. When Synchronization Is Necessary
      3. 5.1.3. When Synchronization Is Not Necessary
    2. 5.2. Synchronization Primitives
      1. 5.2.1. Per-CPU Variables
      2. 5.2.2. Atomic Operations
      3. 5.2.3. Optimization and Memory Barriers
      4. 5.2.4. Spin Locks
        1. 5.2.4.1. The spin_lock macro with kernel preemption
        2. 5.2.4.2. The spin_lock macro without kernel preemption
        3. 5.2.4.3. The spin_unlock macro
      5. 5.2.5. Read/Write Spin Locks
        1. 5.2.5.1. Getting and releasing a lock for reading
        2. 5.2.5.2. Getting and releasing a lock for writing
      6. 5.2.6. Seqlocks
      7. 5.2.7. Read-Copy Update (RCU)
      8. 5.2.8. Semaphores
        1. 5.2.8.1. Getting and releasing semaphores
      9. 5.2.9. Read/Write Semaphores
      10. 5.2.10. Completions
      11. 5.2.11. Local Interrupt Disabling
      12. 5.2.12. Disabling and Enabling Deferrable Functions
    3. 5.3. Synchronizing Accesses to Kernel Data Structures
      1. 5.3.1. Choosing Among Spin Locks, Semaphores, and Interrupt Disabling
        1. 5.3.1.1. Protecting a data structure accessed by exceptions
        2. 5.3.1.2. Protecting a data structure accessed by interrupts
        3. 5.3.1.3. Protecting a data structure accessed by deferrable functions
        4. 5.3.1.4. Protecting a data structure accessed by exceptions and interrupts
        5. 5.3.1.5. Protecting a data structure accessed by exceptions and deferrable functions
        6. 5.3.1.6. Protecting a data structure accessed by interrupts and deferrable functions
        7. 5.3.1.7. Protecting a data structure accessed by exceptions, interrupts, and deferrable functions
    4. 5.4. Examples of Race Condition Prevention
      1. 5.4.1. Reference Counters
      2. 5.4.2. The Big Kernel Lock
      3. 5.4.3. Memory Descriptor Read/Write Semaphore
      4. 5.4.4. Slab Cache List Semaphore
      5. 5.4.5. Inode Semaphore
  8. 6. Timing Measurements
    1. 6.1. Clock and Timer Circuits
      1. 6.1.1. Real Time Clock (RTC)
      2. 6.1.2. Time Stamp Counter (TSC)
      3. 6.1.3. Programmable Interval Timer (PIT)
      4. 6.1.4. CPU Local Timer
      5. 6.1.5. High Precision Event Timer (HPET)
      6. 6.1.6. ACPI Power Management Timer
    2. 6.2. The Linux Timekeeping Architecture
      1. 6.2.1. Data Structures of the Timekeeping Architecture
        1. 6.2.1.1. The timer object
        2. 6.2.1.2. The jiffies variable
        3. 6.2.1.3. The xtime variable
      2. 6.2.2. Timekeeping Architecture in Uniprocessor Systems
        1. 6.2.2.1. Initialization phase
        2. 6.2.2.2. The timer interrupt handler
      3. 6.2.3. Timekeeping Architecture in Multiprocessor Systems
        1. 6.2.3.1. Initialization phase
        2. 6.2.3.2. The global timer interrupt handler
        3. 6.2.3.3. The local timer interrupt handler
    3. 6.3. Updating the Time and Date
    4. 6.4. Updating System Statistics
      1. 6.4.1. Updating Local CPU Statistics
      2. 6.4.2. Keeping Track of System Load
      3. 6.4.3. Profiling the Kernel Code
      4. 6.4.4. Checking the NMI Watchdogs
    5. 6.5. Software Timers and Delay Functions
      1. 6.5.1. Dynamic Timers
        1. 6.5.1.1. Dynamic timers and race conditions
        2. 6.5.1.2. Data structures for dynamic timers
        3. 6.5.1.3. Dynamic timer handling
      2. 6.5.2. An Application of Dynamic Timers: the nanosleep( ) System Call
      3. 6.5.3. Delay Functions
    6. 6.6. System Calls Related to Timing Measurements
      1. 6.6.1. The time( ) and gettimeofday( ) System Calls
      2. 6.6.2. The adjtimex( ) System Call
      3. 6.6.3. The setitimer( ) and alarm( ) System Calls
      4. 6.6.4. System Calls for POSIX Timers
  9. 7. Process Scheduling
    1. 7.1. Scheduling Policy
      1. 7.1.1. Process Preemption
      2. 7.1.2. How Long Must a Quantum Last?
    2. 7.2. The Scheduling Algorithm
      1. 7.2.1. Scheduling of Conventional Processes
        1. 7.2.1.1. Base time quantum
        2. 7.2.1.2. Dynamic priority and average sleep time
        3. 7.2.1.3. Active and expired processes
      2. 7.2.2. Scheduling of Real-Time Processes
    3. 7.3. Data Structures Used by the Scheduler
      1. 7.3.1. The runqueue Data Structure
      2. 7.3.2. Process Descriptor
    4. 7.4. Functions Used by the Scheduler
      1. 7.4.1. The scheduler_tick( ) Function
        1. 7.4.1.1. Updating the time slice of a real-time process
        2. 7.4.1.2. Updating the time slice of a conventional process
      2. 7.4.2. The try_to_wake_up( ) Function
      3. 7.4.3. The recalc_task_prio( ) Function
      4. 7.4.4. The schedule( ) Function
        1. 7.4.4.1. Direct invocation
        2. 7.4.4.2. Lazy invocation
        3. 7.4.4.3. Actions performed by schedule( ) before a process switch
        4. 7.4.4.4. Actions performed by schedule( ) to make the process switch
        5. 7.4.4.5. Actions performed by schedule( ) after a process switch
    5. 7.5. Runqueue Balancing in Multiprocessor Systems
      1. 7.5.1. Scheduling Domains
      2. 7.5.2. The rebalance_tick( ) Function
      3. 7.5.3. The load_balance( ) Function
      4. 7.5.4. The move_tasks( ) Function
    6. 7.6. System Calls Related to Scheduling
      1. 7.6.1. The nice( ) System Call
      2. 7.6.2. The getpriority( ) and setpriority( ) System Calls
      3. 7.6.3. The sched_getaffinity( ) and sched_setaffinity( ) System Calls
      4. 7.6.4. System Calls Related to Real-Time Processes
        1. 7.6.4.1. The sched_getscheduler( ) and sched_setscheduler( ) system calls
        2. 7.6.4.2. The sched_ getparam( ) and sched_setparam( ) system calls
        3. 7.6.4.3. The sched_ yield( ) system call
        4. 7.6.4.4. The sched_ get_priority_min( ) and sched_ get_priority_max( ) system calls
        5. 7.6.4.5. The sched_rr_ get_interval( ) system call
  10. 8. Memory Management
    1. 8.1. Page Frame Management
      1. 8.1.1. Page Descriptors
      2. 8.1.2. Non-Uniform Memory Access (NUMA)
      3. 8.1.3. Memory Zones
      4. 8.1.4. The Pool of Reserved Page Frames
      5. 8.1.5. The Zoned Page Frame Allocator
        1. 8.1.5.1. Requesting and releasing page frames
      6. 8.1.6. Kernel Mappings of High-Memory Page Frames
        1. 8.1.6.1. Permanent kernel mappings
        2. 8.1.6.2. Temporary kernel mappings
      7. 8.1.7. The Buddy System Algorithm
        1. 8.1.7.1. Data structures
        2. 8.1.7.2. Allocating a block
        3. 8.1.7.3. Freeing a block
      8. 8.1.8. The Per-CPU Page Frame Cache
        1. 8.1.8.1. Allocating page frames through the per-CPU page frame caches
        2. 8.1.8.2. Releasing page frames to the per-CPU page frame caches
      9. 8.1.9. The Zone Allocator
        1. 8.1.9.1. Releasing a group of page frames
    2. 8.2. Memory Area Management
      1. 8.2.1. The Slab Allocator
      2. 8.2.2. Cache Descriptor
      3. 8.2.3. Slab Descriptor
      4. 8.2.4. General and Specific Caches
      5. 8.2.5. Interfacing the Slab Allocator with the Zoned Page Frame Allocator
      6. 8.2.6. Allocating a Slab to a Cache
      7. 8.2.7. Releasing a Slab from a Cache
      8. 8.2.8. Object Descriptor
      9. 8.2.9. Aligning Objects in Memory
      10. 8.2.10. Slab Coloring
      11. 8.2.11. Local Caches of Free Slab Objects
      12. 8.2.12. Allocating a Slab Object
      13. 8.2.13. Freeing a Slab Object
      14. 8.2.14. General Purpose Objects
      15. 8.2.15. Memory Pools
    3. 8.3. Noncontiguous Memory Area Management
      1. 8.3.1. Linear Addresses of Noncontiguous Memory Areas
      2. 8.3.2. Descriptors of Noncontiguous Memory Areas
      3. 8.3.3. Allocating a Noncontiguous Memory Area
      4. 8.3.4. Releasing a Noncontiguous Memory Area
  11. 9. Process Address Space
    1. 9.1. The Process's Address Space
    2. 9.2. The Memory Descriptor
      1. 9.2.1. Memory Descriptor of Kernel Threads
    3. 9.3. Memory Regions
      1. 9.3.1. Memory Region Data Structures
      2. 9.3.2. Memory Region Access Rights
      3. 9.3.3. Memory Region Handling
        1. 9.3.3.1. Finding the closest region to a given address: find_vma( )
        2. 9.3.3.2. Finding a region that overlaps a given interval: find_vma_intersection( )
        3. 9.3.3.3. Finding a free interval: get_unmapped_area( )
        4. 9.3.3.4. Inserting a region in the memory descriptor list: insert_vm_struct( )
      4. 9.3.4. Allocating a Linear Address Interval
      5. 9.3.5. Releasing a Linear Address Interval
        1. 9.3.5.1. The do_munmap( ) function
        2. 9.3.5.2. The split_vma( ) function
        3. 9.3.5.3. The unmap_region( ) function
    4. 9.4. Page Fault Exception Handler
      1. 9.4.1. Handling a Faulty Address Outside the Address Space
      2. 9.4.2. Handling a Faulty Address Inside the Address Space
      3. 9.4.3. Demand Paging
      4. 9.4.4. Copy On Write
      5. 9.4.5. Handling Noncontiguous Memory Area Accesses
    5. 9.5. Creating and Deleting a Process Address Space
      1. 9.5.1. Creating a Process Address Space
      2. 9.5.2. Deleting a Process Address Space
    6. 9.6. Managing the Heap
  12. 10. System Calls
    1. 10.1. POSIX APIs and System Calls
    2. 10.2. System Call Handler and Service Routines
    3. 10.3. Entering and Exiting a System Call
      1. 10.3.1. Issuing a System Call via the int $0x80 Instruction
        1. 10.3.1.1. The system_call( ) function
        2. 10.3.1.2. Exiting from the system call
      2. 10.3.2. Issuing a System Call via the sysenter Instruction
        1. 10.3.2.1. The sysenter instruction
        2. 10.3.2.2. The vsyscall page
        3. 10.3.2.3. Entering the system call
        4. 10.3.2.4. Exiting from the system call
        5. 10.3.2.5. The sysexit instruction
        6. 10.3.2.6. The SYSENTER_RETURN code
    4. 10.4. Parameter Passing
      1. 10.4.1. Verifying the Parameters
      2. 10.4.2. Accessing the Process Address Space
      3. 10.4.3. Dynamic Address Checking: The Fix-up Code
      4. 10.4.4. The Exception Tables
      5. 10.4.5. Generating the Exception Tables and the Fixup Code
    5. 10.5. Kernel Wrapper Routines
  13. 11. Signals
    1. 11.1. The Role of Signals
      1. 11.1.1. Actions Performed upon Delivering a Signal
      2. 11.1.2. POSIX Signals and Multithreaded Applications
      3. 11.1.3. Data Structures Associated with Signals
        1. 11.1.3.1. The signal descriptor and the signal handler descriptor
        2. 11.1.3.2. The sigaction data structure
        3. 11.1.3.3. The pending signal queues
      4. 11.1.4. Operations on Signal Data Structures
    2. 11.2. Generating a Signal
      1. 11.2.1. The specific_send_sig_info( ) Function
      2. 11.2.2. The send_signal( ) Function
      3. 11.2.3. The group_send_sig_info( ) Function
    3. 11.3. Delivering a Signal
      1. 11.3.1. Executing the Default Action for the Signal
      2. 11.3.2. Catching the Signal
        1. 11.3.2.1. Setting up the frame
        2. 11.3.2.2. Evaluating the signal flags
        3. 11.3.2.3. Starting the signal handler
        4. 11.3.2.4. Terminating the signal handler
      3. 11.3.3. Reexecution of System Calls
        1. 11.3.3.1. Restarting a system call interrupted by a non-caught signal
        2. 11.3.3.2. Restarting a system call for a caught signal
    4. 11.4. System Calls Related to Signal Handling
      1. 11.4.1. The kill( ) System Call
      2. 11.4.2. The tkill( ) and tgkill( ) System Calls
      3. 11.4.3. Changing a Signal Action
      4. 11.4.4. Examining the Pending Blocked Signals
      5. 11.4.5. Modifying the Set of Blocked Signals
      6. 11.4.6. Suspending the Process
      7. 11.4.7. System Calls for Real-Time Signals
  14. 12. The Virtual Filesystem
    1. 12.1. The Role of the Virtual Filesystem (VFS)
      1. 12.1.1. The Common File Model
      2. 12.1.2. System Calls Handled by the VFS
    2. 12.2. VFS Data Structures
      1. 12.2.1. Superblock Objects
      2. 12.2.2. Inode Objects
      3. 12.2.3. File Objects
      4. 12.2.4. dentry Objects
      5. 12.2.5. The dentry Cache
      6. 12.2.6. Files Associated with a Process
    3. 12.3. Filesystem Types
      1. 12.3.1. Special Filesystems
      2. 12.3.2. Filesystem Type Registration
    4. 12.4. Filesystem Handling
      1. 12.4.1. Namespaces
      2. 12.4.2. Filesystem Mounting
      3. 12.4.3. Mounting a Generic Filesystem
        1. 12.4.3.1. The do_kern_mount( ) function
        2. 12.4.3.2. Allocating a superblock object
      4. 12.4.4. Mounting the Root Filesystem
        1. 12.4.4.1. Phase 1: Mounting the rootfs filesystem
        2. 12.4.4.2. Phase 2: Mounting the real root filesystem
      5. 12.4.5. Unmounting a Filesystem
    5. 12.5. Pathname Lookup
      1. 12.5.1. Standard Pathname Lookup
      2. 12.5.2. Parent Pathname Lookup
      3. 12.5.3. Lookup of Symbolic Links
    6. 12.6. Implementations of VFS System Calls
      1. 12.6.1. The open( ) System Call
      2. 12.6.2. The read( ) and write( ) System Calls
      3. 12.6.3. The close( ) System Call
    7. 12.7. File Locking
      1. 12.7.1. Linux File Locking
      2. 12.7.2. File-Locking Data Structures
      3. 12.7.3. FL_FLOCK Locks
      4. 12.7.4. FL_POSIX Locks
  15. 13. I/O Architecture and Device Drivers
    1. 13.1. I/O Architecture
      1. 13.1.1. I/O Ports
        1. 13.1.1.1. Accessing I/O ports
      2. 13.1.2. I/O Interfaces
        1. 13.1.2.1. Custom I/O interfaces
        2. 13.1.2.2. General-purpose I/O interfaces
      3. 13.1.3. Device Controllers
    2. 13.2. The Device Driver Model
      1. 13.2.1. The sysfs Filesystem
      2. 13.2.2. Kobjects
        1. 13.2.2.1. Kobjects, ksets, and subsystems
        2. 13.2.2.2. Registering kobjects, ksets, and subsystems
      3. 13.2.3. Components of the Device Driver Model
        1. 13.2.3.1. Devices
        2. 13.2.3.2. Drivers
        3. 13.2.3.3. Buses
        4. 13.2.3.4. Classes
    3. 13.3. Device Files
      1. 13.3.1. User Mode Handling of Device Files
        1. 13.3.1.1. Dynamic device number assignment
        2. 13.3.1.2. Dynamic device file creation
      2. 13.3.2. VFS Handling of Device Files
    4. 13.4. Device Drivers
      1. 13.4.1. Device Driver Registration
      2. 13.4.2. Device Driver Initialization
      3. 13.4.3. Monitoring I/O Operations
        1. 13.4.3.1. Polling mode
        2. 13.4.3.2. Interrupt mode
      4. 13.4.4. Accessing the I/O Shared Memory
      5. 13.4.5. Direct Memory Access (DMA)
        1. 13.4.5.1. Synchronous and asynchronous DMA
        2. 13.4.5.2. Helper functions for DMA transfers
        3. 13.4.5.3. Bus addresses
        4. 13.4.5.4. Cache coherency
        5. 13.4.5.5. Helper functions for coherent DMA mappings
        6. 13.4.5.6. Helper functions for streaming DMA mappings
      6. 13.4.6. Levels of Kernel Support
    5. 13.5. Character Device Drivers
      1. 13.5.1. Assigning Device Numbers
        1. 13.5.1.1. The register_chrdev_region( ) and alloc_chrdev_region( ) functions
        2. 13.5.1.2. The register_chrdev( ) function
      2. 13.5.2. Accessing a Character Device Driver
      3. 13.5.3. Buffering Strategies for Character Devices
  16. 14. Block Device Drivers
    1. 14.1. Block Devices Handling
      1. 14.1.1. Sectors
      2. 14.1.2. Blocks
      3. 14.1.3. Segments
    2. 14.2. The Generic Block Layer
      1. 14.2.1. The Bio Structure
      2. 14.2.2. Representing Disks and Disk Partitions
      3. 14.2.3. Submitting a Request
    3. 14.3. The I/O Scheduler
      1. 14.3.1. Request Queue Descriptors
      2. 14.3.2. Request Descriptors
        1. 14.3.2.1. Managing the allocation of request descriptors
        2. 14.3.2.2. Avoiding request queue congestion
      3. 14.3.3. Activating the Block Device Driver
      4. 14.3.4. I/O Scheduling Algorithms
        1. 14.3.4.1. The "Noop" elevator
        2. 14.3.4.2. The "CFQ" elevator
        3. 14.3.4.3. The "Deadline" elevator
        4. 14.3.4.4. The "Anticipatory" elevator
      5. 14.3.5. Issuing a Request to the I/O Scheduler
        1. 14.3.5.1. The blk_queue_bounce( ) function
    4. 14.4. Block Device Drivers
      1. 14.4.1. Block Devices
        1. 14.4.1.1. Accessing a block device
      2. 14.4.2. Device Driver Registration and Initialization
        1. 14.4.2.1. Defining a custom driver descriptor
        2. 14.4.2.2. Initializing the custom descriptor
        3. 14.4.2.3. Initializing the gendisk descriptor
        4. 14.4.2.4. Initializing the table of block device methods
        5. 14.4.2.5. Allocating and initializing a request queue
        6. 14.4.2.6. Setting up the interrupt handler
        7. 14.4.2.7. Registering the disk
      3. 14.4.3. The Strategy Routine
      4. 14.4.4. The Interrupt Handler
    5. 14.5. Opening a Block Device File
  17. 15. The Page Cache
    1. 15.1. The Page Cache
      1. 15.1.1. The address_space Object
      2. 15.1.2. The Radix Tree
      3. 15.1.3. Page Cache Handling Functions
        1. 15.1.3.1. Finding a page
        2. 15.1.3.2. Adding a page
        3. 15.1.3.3. Removing a page
        4. 15.1.3.4. Updating a page
      4. 15.1.4. The Tags of the Radix Tree
    2. 15.2. Storing Blocks in the Page Cache
      1. 15.2.1. Block Buffers and Buffer Heads
      2. 15.2.2. Managing the Buffer Heads
      3. 15.2.3. Buffer Pages
      4. 15.2.4. Allocating Block Device Buffer Pages
      5. 15.2.5. Releasing Block Device Buffer Pages
      6. 15.2.6. Searching Blocks in the Page Cache
        1. 15.2.6.1. The _ _find_get_block( ) function
        2. 15.2.6.2. The _ _getblk( ) function
        3. 15.2.6.3. The _ _bread( ) function
      7. 15.2.7. Submitting Buffer Heads to the Generic Block Layer
        1. 15.2.7.1. The submit_bh( ) function
        2. 15.2.7.2. The ll_rw_block( ) function
    3. 15.3. Writing Dirty Pages to Disk
      1. 15.3.1. The pdflush Kernel Threads
      2. 15.3.2. Looking for Dirty Pages To Be Flushed
      3. 15.3.3. Retrieving Old Dirty Pages
    4. 15.4. The sync( ), fsync( ), and fdatasync( ) System Calls
      1. 15.4.1. The sync ( ) System Call
      2. 15.4.2. The fsync ( ) and fdatasync ( ) System Calls
  18. 16. Accessing Files
    1. 16.1. Reading and Writing a File
      1. 16.1.1. Reading from a File
        1. 16.1.1.1. The readpage method for regular files
        2. 16.1.1.2. The readpage method for block device files
      2. 16.1.2. Read-Ahead of Files
        1. 16.1.2.1. The page_cache_readahead( ) function
        2. 16.1.2.2. The handle_ra_miss( ) function
      3. 16.1.3. Writing to a File
        1. 16.1.3.1. The prepare_write and commit_write methods for regular files
        2. 16.1.3.2. The prepare_write and commit_write methods for block device files
      4. 16.1.4. Writing Dirty Pages to Disk
    2. 16.2. Memory Mapping
      1. 16.2.1. Memory Mapping Data Structures
      2. 16.2.2. Creating a Memory Mapping
      3. 16.2.3. Destroying a Memory Mapping
      4. 16.2.4. Demand Paging for Memory Mapping
      5. 16.2.5. Flushing Dirty Memory Mapping Pages to Disk
      6. 16.2.6. Non-Linear Memory Mappings
    3. 16.3. Direct I/O Transfers
    4. 16.4. Asynchronous I/O
      1. 16.4.1. Asynchronous I/O in Linux 2.6
        1. 16.4.1.1. The asynchronous I/O context
        2. 16.4.1.2. Submitting the asynchronous I/O operations
  19. 17. Page Frame Reclaiming
    1. 17.1. The Page Frame Reclaiming Algorithm
      1. 17.1.1. Selecting a Target Page
      2. 17.1.2. Design of the PFRA
    2. 17.2. Reverse Mapping
      1. 17.2.1. Reverse Mapping for Anonymous Pages
        1. 17.2.1.1. The try_to_unmap_anon( ) function
        2. 17.2.1.2. The try_to_unmap_one( ) function
      2. 17.2.2. Reverse Mapping for Mapped Pages
        1. 17.2.2.1. The priority search tree
        2. 17.2.2.2. The try_to_unmap_file( ) function
    3. 17.3. Implementing the PFRA
      1. 17.3.1. The Least Recently Used (LRU) Lists
        1. 17.3.1.1. Moving pages across the LRU lists
        2. 17.3.1.2. The mark_page_accessed( ) function
        3. 17.3.1.3. The page_referenced( ) function
        4. 17.3.1.4. The refill_inactive_zone( ) function
      2. 17.3.2. Low On Memory Reclaiming
        1. 17.3.2.1. The free_more_memory( ) function
        2. 17.3.2.2. The try_to_free_pages( ) function
        3. 17.3.2.3. The shrink_caches( ) function
        4. 17.3.2.4. The shrink_zone( ) function
        5. 17.3.2.5. The shrink_cache( ) function
        6. 17.3.2.6. The shrink_list( ) function
        7. 17.3.2.7. The pageout( ) function
      3. 17.3.3. Reclaiming Pages of Shrinkable Disk Caches
        1. 17.3.3.1. Reclaiming page frames from the dentry cache
        2. 17.3.3.2. Reclaiming page frames from the inode cache
      4. 17.3.4. Periodic Reclaiming
        1. 17.3.4.1. The kswapd kernel threads
        2. 17.3.4.2. The cache_reap( ) function
      5. 17.3.5. The Out of Memory Killer
      6. 17.3.6. The Swap Token
    4. 17.4. Swapping
      1. 17.4.1. Swap Area
        1. 17.4.1.1. Creating and activating a swap area
        2. 17.4.1.2. How to distribute pages in the swap areas
      2. 17.4.2. Swap Area Descriptor
      3. 17.4.3. Swapped-Out Page Identifier
      4. 17.4.4. Activating and Deactivating a Swap Area
        1. 17.4.4.1. The sys_swapon( ) service routine
        2. 17.4.4.2. The sys_swapoff( ) service routine
        3. 17.4.4.3. The try_to_unuse( ) function
      5. 17.4.5. Allocating and Releasing a Page Slot
        1. 17.4.5.1. The scan_swap_map( ) function
        2. 17.4.5.2. The get_swap_page( ) function
        3. 17.4.5.3. The swap_free( ) function
      6. 17.4.6. The Swap Cache
        1. 17.4.6.1. Swap cache implementation
        2. 17.4.6.2. Swap cache helper functions
      7. 17.4.7. Swapping Out Pages
        1. 17.4.7.1. Inserting the page frame in the swap cache
        2. 17.4.7.2. Updating the Page Table entries
        3. 17.4.7.3. Writing the page into the swap area
        4. 17.4.7.4. Removing the page frame from the swap cache
      8. 17.4.8. Swapping in Pages
        1. 17.4.8.1. The do_swap_page( ) function
        2. 17.4.8.2. The read_swap_cache_async( ) function
  20. 18. The Ext2 and Ext3 Filesystems
    1. 18.1. General Characteristics of Ext2
    2. 18.2. Ext2 Disk Data Structures
      1. 18.2.1. Superblock
      2. 18.2.2. Group Descriptor and Bitmap
      3. 18.2.3. Inode Table
      4. 18.2.4. Extended Attributes of an Inode
      5. 18.2.5. Access Control Lists
      6. 18.2.6. How Various File Types Use Disk Blocks
        1. 18.2.6.1. Regular file
        2. 18.2.6.2. Directory
        3. 18.2.6.3. Symbolic link
        4. 18.2.6.4. Device file, pipe, and socket
    3. 18.3. Ext2 Memory Data Structures
      1. 18.3.1. The Ext2 Superblock Object
      2. 18.3.2. The Ext2 inode Object
    4. 18.4. Creating the Ext2 Filesystem
    5. 18.5. Ext2 Methods
      1. 18.5.1. Ext2 Superblock Operations
      2. 18.5.2. Ext2 inode Operations
      3. 18.5.3. Ext2 File Operations
    6. 18.6. Managing Ext2 Disk Space
      1. 18.6.1. Creating inodes
      2. 18.6.2. Deleting inodes
      3. 18.6.3. Data Blocks Addressing
      4. 18.6.4. File Holes
      5. 18.6.5. Allocating a Data Block
      6. 18.6.6. Releasing a Data Block
    7. 18.7. The Ext3 Filesystem
      1. 18.7.1. Journaling Filesystems
      2. 18.7.2. The Ext3 Journaling Filesystem
      3. 18.7.3. The Journaling Block Device Layer
        1. 18.7.3.1. Log records
        2. 18.7.3.2. Atomic operation handles
        3. 18.7.3.3. Transactions
      4. 18.7.4. How Journaling Works
  21. 19. Process Communication
    1. 19.1. Pipes
      1. 19.1.1. Using a Pipe
      2. 19.1.2. Pipe Data Structures
        1. 19.1.2.1. The pipefs special filesystem
      3. 19.1.3. Creating and Destroying a Pipe
      4. 19.1.4. Reading from a Pipe
      5. 19.1.5. Writing into a Pipe
    2. 19.2. FIFOs
      1. 19.2.1. Creating and Opening a FIFO
    3. 19.3. System V IPC
      1. 19.3.1. Using an IPC Resource
      2. 19.3.2. The ipc( ) System Call
      3. 19.3.3. IPC Semaphores
        1. 19.3.3.1. Undoable semaphore operations
        2. 19.3.3.2. The queue of pending requests
      4. 19.3.4. IPC Messages
      5. 19.3.5. IPC Shared Memory
        1. 19.3.5.1. Swapping out pages of IPC shared memory regions
        2. 19.3.5.2. Demand paging for IPC shared memory regions
    4. 19.4. POSIX Message Queues
  22. 20. Program ExZecution
    1. 20.1. Executable Files
      1. 20.1.1. Process Credentials and Capabilities
        1. 20.1.1.1. Process capabilities
        2. 20.1.1.2. The Linux Security Modules framework
      2. 20.1.2. Command-Line Arguments and Shell Environment
      3. 20.1.3. Libraries
      4. 20.1.4. Program Segments and Process Memory Regions
        1. 20.1.4.1. Flexible memory region layout
      5. 20.1.5. Execution Tracing
    2. 20.2. Executable Formats
    3. 20.3. Execution Domains
    4. 20.4. The exec Functions
  23. A. System Startup
    1. A.1. Prehistoric Age: the BIOS
    2. A.2. Ancient Age: the Boot Loader
      1. A.2.1. Booting Linux from a Disk
    3. A.3. Middle Ages: the setup( ) Function
    4. A.4. Renaissance: the startup_32( ) Functions
    5. A.5. Modern Age: the start_kernel( ) Function
  24. B. Modules
    1. B.1. To Be (a Module) or Not to Be?
      1. B.1.1. Module Licenses
    2. B.2. Module Implementation
      1. B.2.1. Module Usage Counters
      2. B.2.2. Exporting Symbols
      3. B.2.3. Module Dependency
    3. B.3. Linking and Unlinking Modules
    4. B.4. Linking Modules on Demand
      1. B.4.1. The modprobe Program
      2. B.4.2. The request_module( ) Function
  25. C. Bibliography
    1. Books on Unix Kernels
    2. Books on the Linux Kernel
    3. Books on PC Architecture and Technical Manuals on Intel Microprocessors
    4. Other Online Documentation Sources
    5. Research Papers Related to Linux Development
  26. About the Authors
  27. Colophon
  28. Copyright