You are previewing The Design and Implementation of the FreeBSD® Operating System, Second Edition.
O'Reilly logo
The Design and Implementation of the FreeBSD® Operating System, Second Edition

Book Description

The most complete, authoritative technical guide to the FreeBSD kernel’s internal structure has now been extensively updated to cover all major improvements between Versions 5 and 11. Approximately one-third of this edition’s content is completely new, and another one-third has been extensively rewritten.

Three long-time FreeBSD project leaders begin with a concise overview of the FreeBSD kernel’s current design and implementation. Next, they cover the FreeBSD kernel from the system-call level down–from the interface to the kernel to the hardware. Explaining key design decisions, they detail the concepts, data structures, and algorithms used in implementing each significant system facility, including process management, security, virtual memory, the I/O system, filesystems, socket IPC, and networking.

This Second Edition

• Explains highly scalable and lightweight virtualization using FreeBSD jails, and virtual-machine acceleration with Xen and Virtio device paravirtualization

• Describes new security features such as Capsicum sandboxing and GELI cryptographic disk protection

• Fully covers NFSv4 and Open Solaris ZFS support

• Introduces FreeBSD’s enhanced volume management and new journaled soft updates

• Explains DTrace’s fine-grained process debugging/profiling

• Reflects major improvements to networking, wireless, and USB support

Readers can use this guide as both a working reference and an in-depth study of a leading contemporary, portable, open source operating system. Technical and sales support professionals will discover both FreeBSD’s capabilities and its limitations. Applications developers will learn how to effectively and efficiently interface with it; system administrators will learn how to maintain, tune, and configure it; and systems programmers will learn how to extend, enhance, and interface with it.

Marshall Kirk McKusick writes, consults, and teaches classes on UNIX- and BSD-related subjects. While at the University of California, Berkeley, he implemented the 4.2BSD fast filesystem. He was research computer scientist at the Berkeley Computer Systems Research Group (CSRG), overseeing development and release of 4.3BSD and 4.4BSD. He is a FreeBSD Foundation board member and a long-time FreeBSD committer. Twice president of the Usenix Association, he is also a member of ACM, IEEE, and AAAS.

George V. Neville-Neil hacks, writes, teaches, and consults on security, networking, and operating systems. A FreeBSD Foundation board member, he served on the FreeBSD Core Team for four years. Since 2004, he has written the “Kode Vicious” column for Queue and Communications of the ACM. He is vice chair of ACM’s Practitioner Board and a member of Usenix Association, ACM, IEEE, and AAAS.

Robert N.M. Watson is a University Lecturer in systems, security, and architecture in the Security Research Group at the University of Cambridge Computer Laboratory. He supervises advanced research in computer architecture, compilers, program analysis, operating systems, networking, and security. A FreeBSD Foundation board member, he served on the Core Team for ten years and has been a committer for fifteen years. He is a member of Usenix Association and ACM.

Table of Contents

  1. About This eBook
  2. Title Page
  3. Copyright Page
  4. Dedication
  5. Contents
  6. Preface
    1. UNIX-like Systems
    2. Berkeley Software Distributions
    3. Material Covered in this Book
    4. Use by Computer Professionals
    5. Use in Courses on Operating Systems
    6. Organization
    7. Getting BSD
    8. Acknowledgments
    9. References
  7. About the Authors
  8. Part I: Overview
    1. Chapter 1. History and Goals
      1. 1.1 History of the UNIX System
        1. Origins
        2. Research UNIX
        3. AT&T UNIX System III and System V
        4. Berkeley Software Distributions
        5. UNIX in the World
      2. 1.2 BSD and Other Systems
        1. The Influence of the User Community
      3. 1.3 The Transition of BSD to Open Source
        1. Networking Release 2
        2. The Lawsuit
        3. 4.4BSD
        4. 4.4BSD-Lite Release 2
      4. 1.4 The FreeBSD Development Model
        1. References
    2. Chapter 2. Design Overview of FreeBSD
      1. 2.1 FreeBSD Facilities and the Kernel
        1. The Kernel
      2. 2.2 Kernel Organization
      3. 2.3 Kernel Services
      4. 2.4 Process Management
        1. Signals
        2. Process Groups and Sessions
      5. 2.5 Security
        1. Process Credentials
        2. Privilege Model
        3. Discretionary Access Control
        4. Capability Model
        5. Jail Lightweight Virtualization
        6. Mandatory Access Control
        7. Event Auditing
        8. Cryptography and Random-Number Generators
      6. 2.6 Memory Management
        1. BSD Memory-Management Design Decisions
        2. Memory Management Inside the Kernel
      7. 2.7 I/O System Overview
        1. Descriptors and I/O
        2. Descriptor Management
        3. Devices
        4. Socket IPC
        5. Scatter-Gather I/O
        6. Multiple Filesystem Support
      8. 2.8 Devices
      9. 2.9 The Fast Filesystem
        1. Filestores
      10. 2.10 The Zettabyte Filesystem
      11. 2.11 The Network Filesystem
      12. 2.12 Interprocess Communication
      13. 2.13 Network-Layer Protocols
      14. 2.14 Transport-Layer Protocols
      15. 2.15 System Startup and Shutdown
        1. Exercises
        2. References
    3. Chapter 3. Kernel Services
      1. 3.1 Kernel Organization
        1. System Processes
        2. System Entry
        3. Run-Time Organization
        4. Entry to the Kernel
        5. Return from the Kernel
      2. 3.2 System Calls
        1. Result Handling
        2. Returning from a System Call
      3. 3.3 Traps and Interrupts
        1. I/O Device Interrupts
        2. Software Interrupts
      4. 3.4 Clock Interrupts
        1. Statistics and Process Scheduling
        2. Timeouts
      5. 3.5 Memory-Management Services
      6. 3.6 Timing Services
        1. Real Time
        2. External Representation
        3. Adjustment of the Time
        4. Interval Time
      7. 3.7 Resource Services
        1. Process Priorities
        2. Resource Utilization
        3. Resource Limits
        4. Filesystem Quotas
      8. 3.8 Kernel Tracing Facilities
        1. System-Call Tracing
        2. DTrace
        3. Kernel Tracing
        4. Exercises
        5. References
  9. Part II: Processes
    1. Chapter 4. Process Management
      1. 4.1 Introduction to Process Management
        1. Multiprogramming
        2. Scheduling
      2. 4.2 Process State
        1. The Process Structure
        2. The Thread Structure
      3. 4.3 Context Switching
        1. Thread State
        2. Low-Level Context Switching
        3. Voluntary Context Switching
        4. Synchronization
        5. Mutex Synchronization
        6. Mutex Interface
        7. Lock Synchronization
        8. Deadlock Prevention
      4. 4.4 Thread Scheduling
        1. The Low-Level Scheduler
        2. Thread Run Queues and Context Switching
        3. Timeshare Thread Scheduling
        4. Multiprocessor Scheduling
        5. Adaptive Idle
        6. Traditional Timeshare Thread Scheduling
      5. 4.5 Process Creation
      6. 4.6 Process Termination
      7. 4.7 Signals
        1. Posting of a Signal
        2. Delivering a Signal
      8. 4.8 Process Groups and Sessions
        1. Process Groups
        2. Sessions
        3. Job Control
      9. 4.9 Process Debugging
        1. Exercises
        2. References
    2. Chapter 5. Security
      1. 5.1 Operating-System Security
      2. 5.2 Security Model
        1. Process Model
        2. Discretionary and Mandatory Access Control
        3. Trusted Computing Base (TCB)
        4. Other Kernel-Security Features
      3. 5.3 Process Credentials
        1. The Credential Structure
        2. Credential Memory Model
        3. Access-Control Checks
      4. 5.4 Users and Groups
        1. Setuid and Setgid Binaries
      5. 5.5 Privilege Model
        1. Implicit Privilege
        2. Explicit Privilege
      6. 5.6 Interprocess Access Control
        1. Visibility
        2. Signals
        3. Scheduling Control
        4. Waiting on Process Termination
        5. Debugging
      7. 5.7 Discretionary Access Control
        1. The Virtual-Filesystem Interface and DAC
        2. Object Owners and Groups
        3. UNIX Permissions
        4. Access Control Lists (ACLs)
        5. POSIX.1e Access Control Lists
        6. NFSv4 Access Control Lists
      8. 5.8 Capsicum Capability Model
        1. Capsicum Application Structure
        2. Capability Systems
        3. Capabilities
        4. Capability Mode
      9. 5.9 Jails
      10. 5.10 Mandatory Access-Control Framework
        1. Mandatory Policies
        2. Guiding Design Principles
        3. Architecture of the MAC Framework
        4. Framework Startup
        5. Policy Registration
        6. Framework Entry-Point Design Considerations
        7. Policy Entry-Point Considerations
        8. Kernel Service Entry-Point Invocation
        9. Policy Composition
        10. Object Labelling
        11. Label Life Cycle and Memory Management
        12. Label Synchronization
        13. Policy-Agnostic Label Management from Userspace
      11. 5.11 Security Event Auditing
        1. Audit Events and Records
        2. BSM Audit Records and Audit Trails
        3. Kernel-Audit Implementation
      12. 5.12 Cryptographic Services
        1. Cryptographic Framework
        2. Random-Number Generator
      13. 5.13 GELI Full-Disk Encryption
        1. Confidentiality and Integrity Protection
        2. Key Management
        3. Starting GELI
        4. Cryptographic Block Protection
        5. I/O Model
        6. Limitations
        7. Exercises
        8. References
    3. Chapter 6. Memory Management
      1. 6.1 Terminology
        1. Processes and Memory
        2. Paging
        3. Replacement Algorithms
        4. Working-Set Model
        5. Swapping
        6. Advantages of Virtual Memory
        7. Hardware Requirements for Virtual Memory
      2. 6.2 Overview of the FreeBSD Virtual-Memory System
        1. User Address-Space Management
      3. 6.3 Kernel Memory Management
        1. Kernel Maps and Submaps
        2. Kernel Address-Space Allocation
        3. The Slab Allocator
        4. The Keg Allocator
        5. The Zone Allocator
        6. Kernel Malloc
        7. Kernel Zone Allocator
      4. 6.4 Per-Process Resources
        1. FreeBSD Process Virtual-Address Space
        2. Page-Fault Dispatch
        3. Mapping to Vm_objects
        4. Vm_objects
        5. Vm_objects to Pages
      5. 6.5 Shared Memory
        1. Mmap Model
        2. Shared Mapping
        3. Private Mapping
        4. Collapsing of Shadow Chains
        5. Private Snapshots
      6. 6.6 Creation of a New Process
        1. Reserving Kernel Resources
        2. Duplication of the User Address Space
        3. Creation of a New Process Without Copying
      7. 6.7 Execution of a File
      8. 6.8 Process Manipulation of Its Address Space
        1. Change of Process Size
        2. File Mapping
        3. Change of Protection
      9. 6.9 Termination of a Process
      10. 6.10 The Pager Interface
        1. Vnode Pager
        2. Device Pager
        3. Physical-Memory Pager
        4. Swap Pager
      11. 6.11 Paging
        1. Hardware-Cache Design
        2. Hardware Memory Management
        3. Superpages
      12. 6.12 Page Replacement
        1. Paging Parameters
        2. The Pageout Daemon
        3. Swapping
        4. The Swap-In Process
      13. 6.13 Portability
        1. The Role of the pmap Module
        2. Initialization and Startup
        3. Mapping Allocation and Deallocation
        4. Change of Access and Wiring Attributes for Mappings
        5. Maintenance of Physical Page-Usage Information
        6. Initialization of Physical Pages
        7. Management of Internal Data Structures
        8. Exercises
        9. References
  10. Part III: I/O System
    1. Chapter 7. I/O System Overview
      1. 7.1 Descriptor Management and Services
        1. Open File Entries
        2. Management of Descriptors
        3. Asynchronous I/O
        4. File-Descriptor Locking
        5. Multiplexing I/O on Descriptors
        6. Implementation of Select
        7. Kqueues and Kevents
        8. Movement of Data Inside the Kernel
      2. 7.2 Local Interprocess Communication
        1. Semaphores
        2. Message Queues
        3. Shared Memory
      3. 7.3 The Virtual-Filesystem Interface
        1. Contents of a Vnode
        2. Vnode Operations
        3. Pathname Translation
        4. Exported Filesystem Services
      4. 7.4 Filesystem-Independent Services
        1. The Name Cache
        2. Buffer Management
        3. Implementation of Buffer Management
      5. 7.5 Stackable Filesystems
        1. Simple Filesystem Layers
        2. The Union Filesystem
        3. Other Filesystems
        4. Exercises
        5. References
    2. Chapter 8. Devices
      1. 8.1 Device Overview
        1. The PC I/O Architecture
        2. The Structure of the FreeBSD Mass Storage I/O Subsystem
        3. Device Naming and Access
      2. 8.2 I/O Mapping from User to Device
        1. Device Drivers
        2. I/O Queueing
        3. Interrupt Handling
      3. 8.3 Character Devices
        1. Raw Devices and Physical I/O
        2. Character-Oriented Devices
        3. Entry Points for Character Device Drivers
      4. 8.4 Disk Devices
        1. Entry Points for Disk Device Drivers
        2. Sorting of Disk I/O Requests
        3. Disk Labels
      5. 8.5 Network Devices
        1. Entry Points for Network Drivers
        2. Configuration and Control
        3. Packet Reception
        4. Packet Transmission
      6. 8.6 Terminal Handling
        1. Terminal-Processing Modes
        2. User Interface
        3. Process Groups, Sessions, and Terminal Control
        4. Terminal Operations
        5. Terminal Output (Upper Half)
        6. Terminal Output (Lower Half)
        7. Terminal Input
        8. Closing of Terminal Devices
      7. 8.7 The GEOM Layer
        1. Terminology and Topology Rules
        2. Changing Topology
        3. Operation
        4. Topological Flexibility
      8. 8.8 The CAM Layer
        1. The Path of a SCSI I/O Request Through the CAM Subsystem
        2. ATA Disks
      9. 8.9 Device Configuration
        1. Device Identification
        2. Autoconfiguration Data Structures
        3. Resource Management
      10. 8.10 Device Virtualization
        1. Interaction with the Hypervisor
        2. Virtio
        3. Xen
        4. Device Pass-Through
        5. Exercises
        6. References
    3. Chapter 9. The Fast Filesystem
      1. 9.1 Hierarchical Filesystem Management
      2. 9.2 Structure of an Inode
        1. Changes to the Inode Format
        2. Extended Attributes
        3. New Filesystem Capabilities
        4. File Flags
        5. Dynamic Inodes
        6. Inode Management
      3. 9.3 Naming
        1. Directories
        2. Finding of Names in Directories
        3. Pathname Translation
        4. Links
      4. 9.4 Quotas
      5. 9.5 File Locking
      6. 9.6 Soft Updates
        1. Update Dependencies in the Filesystem
        2. Dependency Structures
        3. Bitmap Dependency Tracking
        4. Inode Dependency Tracking
        5. Direct-Block Dependency Tracking
        6. Indirect-Block Dependency Tracking
        7. Dependency Tracking for New Indirect Blocks
        8. New Directory-Entry Dependency Tracking
        9. New Directory Dependency Tracking
        10. Directory-Entry Removal-Dependency Tracking
        11. File Truncation
        12. File and Directory Inode Reclamation
        13. Directory-Entry Renaming Dependency Tracking
        14. Fsync Requirements for Soft Updates
        15. File-Removal Requirements for Soft Updates
        16. Soft-Updates Requirements for fsck
      7. 9.7 Filesystem Snapshots
        1. Creating a Filesystem Snapshot
        2. Maintaining a Filesystem Snapshot
        3. Large Filesystem Snapshots
        4. Background fsck
        5. User-Visible Snapshots
        6. Live Dumps
      8. 9.8 Journaled Soft Updates
        1. Background and Introduction
        2. Compatibility with Other Implementations
        3. Journal Format
        4. Modifications That Require Journaling
        5. Additional Requirements of Journaling
        6. The Recovery Process
        7. Performance
        8. Future Work
        9. Tracking File-Removal Dependencies
      9. 9.9 The Local Filestore
        1. Overview of the Filestore
        2. User I/O to a File
      10. 9.10 The Berkeley Fast Filesystem
        1. Organization of the Berkeley Fast Filesystem
        2. Boot Blocks
        3. Optimization of Storage Utilization
        4. Reading and Writing to a File
        5. Layout Policies
        6. Allocation Mechanisms
        7. Block Clustering
        8. Extent-Based Allocation
        9. Exercises
        10. References
    4. Chapter 10. The Zettabyte Filesystem
      1. 10.1 Introduction
      2. 10.2 ZFS Organization
        1. ZFS Dnode
        2. ZFS Block Pointers
        3. ZFS objset Structure
      3. 10.3 ZFS Structure
        1. The MOS Layer
        2. The Object-Set Layer
      4. 10.4 ZFS Operation
        1. Writing New Data to Disk
        2. Logging
        3. RAIDZ
        4. Snapshots
        5. ZFS Block Allocation
        6. Freeing Blocks
        7. Deduplication
        8. Remote Replication
      5. 10.5 ZFS Design Tradeoffs
        1. Exercises
        2. References
    5. Chapter 11. The Network Filesystem
      1. 11.1 Overview
      2. 11.2 Structure and Operation
        1. The FreeBSD NFS Implementation
        2. Client–Server Interactions
        3. Security Issues
        4. Techniques for Improving Performance
      3. 11.3 NFS Evolution
        1. Namespace
        2. Attributes
        3. Access Control Lists
        4. Caching, Delegation, and Callbacks
        5. Locking
        6. Security
        7. Crash Recovery
        8. Exercises
        9. References
  11. Part IV: Interprocess Communication
    1. Chapter 12. Interprocess Communication
      1. 12.1 Interprocess-Communication Model
        1. Use of Sockets
      2. 12.2 Implementation Structure and Overview
      3. 12.3 Memory Management
        1. Mbufs
        2. Storage-Management Algorithms
        3. Mbuf Utility Routines
      4. 12.4 IPC Data Structures
        1. Socket Addresses
        2. Locks
      5. 12.5 Connection Setup
      6. 12.6 Data Transfer
        1. Transmitting Data
        2. Receiving Data
      7. 12.7 Socket Shutdown
      8. 12.8 Network-Communication Protocol Internal Structure
        1. Data Flow
        2. Communication Protocols
      9. 12.9 Socket-to-Protocol Interface
        1. Protocol User-Request Routines
        2. Protocol Control-Output Routine
      10. 12.10 Protocol-to-Protocol Interface
        1. pr_output
        2. pr_input
        3. pr_ctlinput
      11. 12.11 Protocol-to-Network Interface
        1. Network Interfaces and Link-Layer Protocols
        2. Packet Transmission
        3. Packet Reception
      12. 12.12 Buffering and Flow Control
        1. Protocol Buffering Policies
        2. Queue Limiting
      13. 12.13 Network Virtualization
        1. Exercises
        2. References
    2. Chapter 13. Network-Layer Protocols
      1. 13.1 Internet Protocol Version 4
        1. IPv4 Addresses
        2. Broadcast Addresses
        3. Internet Multicast
        4. Link-Layer Address Resolution
      2. 13.2 Internet Control Message Protocols (ICMP)
      3. 13.3 Internet Protocol Version 6
        1. IPv6 Addresses
        2. IPv6 Packet Formats
        3. Changes to the Socket API
        4. Autoconfiguration
      4. 13.4 Internet Protocols Code Structure
        1. Output
        2. Input
        3. Forwarding
      5. 13.5 Routing
        1. Kernel Routing Tables
        2. Routing Lookup
        3. Routing Redirects
        4. Routing-Table Interface
        5. User-Level Routing Policies
        6. User-Level Routing Interface: Routing Socket
      6. 13.6 Raw Sockets
        1. Control Blocks
        2. Input Processing
        3. Output Processing
      7. 13.7 Security
        1. IPSec Overview
        2. Security Protocols
        3. Key Management
        4. IPSec Implementation
      8. 13.8 Packet-Processing Frameworks
        1. Berkeley Packet Filter
        2. IP Firewalls
        3. IPFW and Dummynet
        4. Packet Filter (PF)
        5. Netgraph
        6. Netmap
        7. Exercises
        8. References
    3. Chapter 14. Transport-Layer Protocols
      1. 14.1 Internet Ports and Associations
        1. Protocol Control Blocks
      2. 14.2 User Datagram Protocol (UDP)
        1. Initialization
        2. Output
        3. Input
        4. Control Operations
      3. 14.3 Transmission Control Protocol (TCP)
        1. TCP Connection States
        2. Sequence Variables
      4. 14.4 TCP Algorithms
        1. Timers
        2. Estimation of Round-Trip Time
        3. Connection Establishment
        4. SYN Cache
        5. SYN Cookies
        6. Connection Shutdown
      5. 14.5 TCP Input Processing
      6. 14.6 TCP Output Processing
        1. Sending Data
        2. Avoidance of the Silly-Window Syndrome
        3. Avoidance of Small Packets
        4. Delayed Acknowledgments and Window Updates
        5. Selective Acknowledgment
        6. Retransmit State
        7. Slow Start
        8. Buffer and Window Sizing
        9. Avoidance of Congestion with Slow Start
        10. Fast Retransmission
        11. Modular Congestion Control
        12. The Vegas Algorithm
        13. The Cubic Algorithm
      7. 14.7 Stream Control Transmission Protocol (SCTP)
        1. Chunks
        2. Association Setup
        3. Data Transfer
        4. Association Shutdown
        5. Multihoming and Heartbeats
        6. Exercises
        7. References
  12. Part V: System Operation
    1. Chapter 15. System Startup and Shutdown
      1. 15.1 Firmware and BIOSes
      2. 15.2 Boot Loaders
        1. Master Boot Record and Globally Unique Identifier Partition Table
        2. The Second-Stage Boot Loader: gptboot
        3. The Final-Stage Boot Loader: /boot/loader
        4. Boot Loading on Embedded Platforms
      3. 15.3 Kernel Boot
        1. Assembly-Language Startup
        2. Platform-Specific C-Language Startup
        3. Modular Kernel Design
        4. Module Initialization
        5. Basic Kernel Services
        6. Kernel-Thread Initialization
        7. Device-Module Initialization
        8. Loadable Kernel Modules
      4. 15.4 User-Level Initialization
        1. /sbin/init
        2. System Startup Scripts
        3. /usr/libexec/getty
        4. /usr/bin/login
      5. 15.5 System Operation
        1. Kernel Configuration
        2. System Shutdown and Autoreboot
        3. System Debugging
        4. Passage of Information To and From the Kernel
        5. Exercises
        6. References
  13. Glossary
  14. Index
  15. FreeBSD Kernel Internals on Video
  16. Advanced FreeBSD Course on Video
  17. FreeBSD Networking from the Bottom Up on Video
  18. CSRG Archive CD-ROMs
  19. History of UNIX at Berkeley
  20. Teaching a Course Using This Book
  21. Code Snippets