You are previewing Understanding Linux Network Internals.
O'Reilly logo
Understanding Linux Network Internals

Book Description

If you've ever wondered how Linux carries out the complicated tasks assigned to it by the IP protocols -- or if you just want to learn about modern networking through real-life examples -- Understanding Linux Network Internals is for you.

Like the popular O'Reilly book, Understanding the Linux Kernel, this book clearly explains the underlying concepts and teaches you how to follow the actual C code that implements it. Although some background in the TCP/IP protocols is helpful, you can learn a great deal from this text about the protocols themselves and their uses. And if you already have a base knowledge of C, you can use the book's code walkthroughs to figure out exactly what this sophisticated part of the Linux kernel is doing.

Part of the difficulty in understanding networks -- and implementing them -- is that the tasks are broken up and performed at many different times by different pieces of code. One of the strengths of this book is to integrate the pieces and reveal the relationships between far-flung functions and data structures. Understanding Linux Network Internals is both a big-picture discussion and a no-nonsense guide to the details of Linux networking. Topics include:

  • Key problems with networking

  • Network interface card (NIC) device drivers

  • System initialization

  • Layer 2 (link-layer) tasks and implementation

  • Layer 3 (IPv4) tasks and implementation

  • Neighbor infrastructure and protocols (ARP)

  • Bridging

  • Routing

  • ICMP

Author Christian Benvenuti, an operating system designer specializing in networking, explains much more than how Linux code works. He shows the purposes of major networking features and the trade-offs involved in choosing one solution over another. A large number of flowcharts and other diagrams enhance the book's understandability.

Table of Contents

  1. Understanding Linux Network Internals
  2. Preface
    1. The Audience for This Book
    2. Background Information
    3. Organization of the Material
      1. What Is Not Covered
    4. Conventions Used in This Book
    5. Using Code Examples
    6. We'd Like to Hear from You
    7. Safari Enabled
    8. Acknowledgments
  3. I. General Background
    1. 1. Introduction
      1. 1.1. Basic Terminology
      2. 1.2. Common Coding Patterns
        1. 1.2.1. Memory Caches
        2. 1.2.2. Caching and Hash Tables
        3. 1.2.3. Reference Counts
        4. 1.2.4. Garbage Collection
        5. 1.2.5. Function Pointers and Virtual Function Tables (VFTs)
        6. 1.2.6. goto Statements
        7. 1.2.7. Vector Definitions
        8. 1.2.8. Conditional Directives (#ifdef and family)
        9. 1.2.9. Compile-Time Optimization for Condition Checks
        10. 1.2.10. Mutual Exclusion
        11. 1.2.11. Conversions Between Host and Network Order
        12. 1.2.12. Catching Bugs
        13. 1.2.13. Statistics
        14. 1.2.14. Measuring Time
      3. 1.3. User-Space Tools
      4. 1.4. Browsing the Source Code
        1. 1.4.1. Dead Code
      5. 1.5. When a Feature Is Offered as a Patch
    2. 2. Critical Data Structures
      1. 2.1. The Socket Buffer: sk_buff Structure
        1. 2.1.1. Networking Options and Kernel Structures
        2. 2.1.2. Layout Fields
        3. 2.1.3. General Fields
        4. 2.1.4. Feature-Specific Fields
        5. 2.1.5. Management Functions
          1. 2.1.5.1. Allocating memory: alloc_skb and dev_alloc_skb
          2. 2.1.5.2. Freeing memory: kfree_skb and dev_kfree_skb
          3. 2.1.5.3. Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull
          4. 2.1.5.4. The skb_shared_info structure and the skb_shinfo function
          5. 2.1.5.5. Cloning and copying buffers
          6. 2.1.5.6. List management functions
      2. 2.2. net_device Structure
        1. 2.2.1. Identifiers
        2. 2.2.2. Configuration
          1. 2.2.2.1. Interface types and ports
          2. 2.2.2.2. Promiscuous mode
        3. 2.2.3. Statistics
        4. 2.2.4. Device Status
        5. 2.2.5. List Management
        6. 2.2.6. Link Layer Multicast
        7. 2.2.7. Traffic Management
        8. 2.2.8. Feature Specific
        9. 2.2.9. Generic
        10. 2.2.10. Function Pointers
      3. 2.3. Files Mentioned in This Chapter
    3. 3. User-Space-to-Kernel Interface
      1. 3.1. Overview
      2. 3.2. procfs Versus sysctl
        1. 3.2.1. procfs
        2. 3.2.2. sysctl: Directory /proc/sys
          1. 3.2.2.1. Examples of ctl_table initialization
          2. 3.2.2.2. Registering a file in /proc/sys
          3. 3.2.2.3. Core networking files and directories
      3. 3.3. ioctl
      4. 3.4. Netlink
      5. 3.5. Serializing Configuration Changes
  4. II. System Initialization
    1. 4. Notification Chains
      1. 4.1. Reasons for Notification Chains
      2. 4.2. Overview
      3. 4.3. Defining a Chain
      4. 4.4. Registering with a Chain
      5. 4.5. Notifying Events on a Chain
      6. 4.6. Notification Chains for the Networking Subsystems
        1. 4.6.1. Wrappers
        2. 4.6.2. Examples
      7. 4.7. Tuning via /proc Filesystem
      8. 4.8. Functions and Variables Featured in This Chapter
      9. 4.9. Files and Directories Featured in This Chapter
    2. 5. Network Device Initialization
      1. 5.1. System Initialization Overview
      2. 5.2. Device Registration and Initialization
      3. 5.3. Basic Goals of NIC Initialization
      4. 5.4. Interaction Between Devices and Kernel
        1. 5.4.1. Hardware Interrupts
          1. 5.4.1.1. Interrupt types
          2. 5.4.1.2. Interrupt sharing
          3. 5.4.1.3. Organization of IRQs to handler mappings
      5. 5.5. Initialization Options
      6. 5.6. Module Options
      7. 5.7. Initializing the Device Handling Layer: net_dev_init
        1. 5.7.1. Legacy Code
      8. 5.8. User-Space Helpers
        1. 5.8.1. kmod
        2. 5.8.2. Hotplug
          1. 5.8.2.1. /sbin/hotplug
      9. 5.9. Virtual Devices
        1. 5.9.1. Examples of Virtual Devices
        2. 5.9.2. Interaction with the Kernel Network Stack
      10. 5.10. Tuning via /proc Filesystem
      11. 5.11. Functions and Variables Featured in This Chapter
      12. 5.12. Files and Directories Featured in This Chapter
    3. 6. The PCI Layer and Network Interface Cards
      1. 6.1. Data Structures Featured in This Chapter
      2. 6.2. Registering a PCI NIC Device Driver
      3. 6.3. Power Management and Wake-on-LAN
      4. 6.4. Example of PCI NIC Driver Registration
      5. 6.5. The Big Picture
      6. 6.6. Tuning via /proc Filesystem
      7. 6.7. Functions and Variables Featured in This Chapter
      8. 6.8. Files and Directories Featured in This Chapter
    4. 7. Kernel Infrastructure for Component Initialization
      1. 7.1. Boot-Time Kernel Options
        1. 7.1.1. Registering a Keyword
        2. 7.1.2. Two-Pass Parsing
        3. 7.1.3. .init.setup Memory Section
        4. 7.1.4. Use of Boot Options to Configure Network Devices
      2. 7.2. Module Initialization Code
        1. 7.2.1. Old Model: Conditional Code
        2. 7.2.2. New Model: Macro-Based Tagging
      3. 7.3. Optimized Macro-Based Tagging
        1. 7.3.1. Initialization Macros for Device Initialization Routines
      4. 7.4. Boot-Time Initialization Routines
        1. 7.4.1. xxx_initcall Macros
          1. 7.4.1.1. Example of _ _initcall and _ _exitcall routines: modules
          2. 7.4.1.2. Example of dependency between initialization routines
          3. 7.4.1.3. Legacy code
      5. 7.5. Memory Optimizations
        1. 7.5.1. _ _init and _ _exit Macros
        2. 7.5.2. xxx_initcall and _ _exitcall Sections
        3. 7.5.3. Other Optimizations
        4. 7.5.4. Dynamic Macros' Definition
      6. 7.6. Tuning via /proc Filesystem
      7. 7.7. Functions and Variables Featured in This Chapter
      8. 7.8. Files and Directories Featured in This Chapter
    5. 8. Device Registration and Initialization
      1. 8.1. When a Device Is Registered
      2. 8.2. When a Device Is Unregistered
      3. 8.3. Allocating net_device Structures
      4. 8.4. Skeleton of NIC Registration and Unregistration
      5. 8.5. Device Initialization
        1. 8.5.1. Device Driver Initializations
        2. 8.5.2. Device Type Initialization: xxx_setup Functions
        3. 8.5.3. Optional Initializations and Special Cases
      6. 8.6. Organization of net_device Structures
        1. 8.6.1. Lookups
      7. 8.7. Device State
        1. 8.7.1. Queuing Discipline State
        2. 8.7.2. Registration State
      8. 8.8. Registering and Unregistering Devices
        1. 8.8.1. Split Operations: netdev_run_todo
        2. 8.8.2. Device Registration Status Notification
          1. 8.8.2.1. netdev_chain notification chain
          2. 8.8.2.2. RTnetlink link notifications
      9. 8.9. Device Registration
        1. 8.9.1. register_netdevice Function
      10. 8.10. Device Unregistration
        1. 8.10.1. unregister_netdevice Function
        2. 8.10.2. Reference Counts
          1. 8.10.2.1. Function netdev_wait_allrefs
      11. 8.11. Enabling and Disabling a Network Device
      12. 8.12. Updating the Device Queuing Discipline State
        1. 8.12.1. Interactions with Power Management
          1. 8.12.1.1. Suspending a device
          2. 8.12.1.2. Resuming a device
        2. 8.12.2. Link State Change Detection
          1. 8.12.2.1. Scheduling and processing link state change events
          2. 8.12.2.2. Linkwatch flags
      13. 8.13. Configuring Device-Related Information from User Space
        1. 8.13.1. Ethtool
          1. 8.13.1.1. Drivers that do not support ethtool
        2. 8.13.2. Media Independent Interface (MII)
      14. 8.14. Virtual Devices
      15. 8.15. Locking
      16. 8.16. Tuning via /proc Filesystem
      17. 8.17. Functions and Variables Featured in This Chapter
      18. 8.18. Files and Directories Featured in This Chapter
  5. III. Transmission and Reception
    1. 9. Interrupts and Network Drivers
      1. 9.1. Decisions and Traffic Direction
      2. 9.2. Notifying Drivers When Frames Are Received
        1. 9.2.1. Polling
        2. 9.2.2. Interrupts
        3. 9.2.3. Processing Multiple Frames During an Interrupt
        4. 9.2.4. Timer-Driven Interrupts
        5. 9.2.5. Combinations
        6. 9.2.6. Example
      3. 9.3. Interrupt Handlers
        1. 9.3.1. Reasons for Bottom Half Handlers
        2. 9.3.2. Bottom Halves Solutions
        3. 9.3.3. Concurrency and Locking
        4. 9.3.4. Preemption
        5. 9.3.5. Bottom-Half Handlers
          1. 9.3.5.1. Bottom-half handlers in kernel 2.2
          2. 9.3.5.2. Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq
        6. 9.3.6. Tasklets
        7. 9.3.7. Softirq Initialization
        8. 9.3.8. Pending softirq Handling
          1. 9.3.8.1. _ _do_softirq function
        9. 9.3.9. Per-Architecture Processing of softirq
        10. 9.3.10. ksoftirqd Kernel Threads
          1. 9.3.10.1. Starting the threads
        11. 9.3.11. Tasklet Processing
        12. 9.3.12. How the Networking Code Uses softirqs
      4. 9.4. softnet_data Structure
        1. 9.4.1. Fields of softnet_data
        2. 9.4.2. Initialization of softnet_data
    2. 10. Frame Reception
      1. 10.1. Interactions with Other Features
      2. 10.2. Enabling and Disabling a Device
      3. 10.3. Queues
      4. 10.4. Notifying the Kernel of Frame Reception: NAPI and netif_rx
        1. 10.4.1. Introduction to the New API (NAPI)
        2. 10.4.2. net_device Fields Used by NAPI
        3. 10.4.3. net_rx_action and NAPI
        4. 10.4.4. Old Versus New Driver Interfaces
        5. 10.4.5. Manipulating poll_list
      5. 10.5. Old Interface Between Device Drivers and Kernel: First Part of netif_rx
        1. 10.5.1. Initial Tasks of netif_rx
        2. 10.5.2. Managing Queues and Scheduling the Bottom Half
      6. 10.6. Congestion Management
        1. 10.6.1. Congestion Management in netif_rx
        2. 10.6.2. Average Queue Length and Congestion-Level Computation
      7. 10.7. Processing the NET_RX_SOFTIRQ: net_rx_action
        1. 10.7.1. Backlog Processing: The process_backlog Poll Virtual Function
        2. 10.7.2. Ingress Frame Processing
          1. 10.7.2.1. Handling special features
    3. 11. Frame Transmission
      1. 11.1. Enabling and Disabling Transmissions
        1. 11.1.1. Scheduling a Device for Transmission
        2. 11.1.2. Queuing Discipline Interface
          1. 11.1.2.1. qdisc_restart function
        3. 11.1.3. dev_queue_xmit Function
          1. 11.1.3.1. Queueful devices
          2. 11.1.3.2. Queueless devices
        4. 11.1.4. Processing the NET_TX_SOFTIRQ: net_tx_action
          1. 11.1.4.1. Watchdog timer
    4. 12. General and Reference Material About Interrupts
      1. 12.1. Statistics
      2. 12.2. Tuning via /proc and sysfs Filesystems
      3. 12.3. Functions and Variables Featured in This Part of the Book
      4. 12.4. Files and Directories Featured in This Part of the Book
    5. 13. Protocol Handlers
      1. 13.1. Overview of Network Stack
        1. 13.1.1. The Big Picture
        2. 13.1.2. Link Layer Choices for Ethernet (LLC and SNAP)
        3. 13.1.3. How the Network Stack Operates
      2. 13.2. Executing the Right Protocol Handler
        1. 13.2.1. Special Media Encapsulation
      3. 13.3. Protocol Handler Organization
      4. 13.4. Protocol Handler Registration
      5. 13.5. Ethernet Versus IEEE 802.3 Frames
        1. 13.5.1. Setting the Packet Type
        2. 13.5.2. Setting the Ethernet Protocol and Length
        3. 13.5.3. Logical Link Control (LLC)
          1. 13.5.3.1. The IPX case
          2. 13.5.3.2. Linux's LLC implementation
          3. 13.5.3.3. Processing ingress LLC frames
        4. 13.5.4. Subnetwork Access Protocol (SNAP)
      6. 13.6. Tuning via /proc Filesystem
      7. 13.7. Functions and Variables Featured in This Chapter
      8. 13.8. Files and Directories Featured in This Chapter
  6. IV. Bridging
    1. 14. Bridging: Concepts
      1. 14.1. Repeaters, Bridges, and Routers
      2. 14.2. Bridges Versus Switches
      3. 14.3. Hosts
      4. 14.4. Merging LANs with Bridges
      5. 14.5. Bridging Different LAN Technologies
      6. 14.6. Address Learning
        1. 14.6.1. Broadcast and Multicast Addresses
        2. 14.6.2. Aging
      7. 14.7. Multiple Bridges
        1. 14.7.1. Bridging Loops
        2. 14.7.2. Loop-Free Topologies
        3. 14.7.3. Defining a Loop-Free Topology
    2. 15. Bridging: The Spanning Tree Protocol
      1. 15.1. Basic Terminology
      2. 15.2. Example of Hierarchical Switched L2 Topology
      3. 15.3. Basic Elements of the Spanning Tree Protocol
        1. 15.3.1. Root Bridge
        2. 15.3.2. Designated Bridges
        3. 15.3.3. Spanning Tree Ports
          1. 15.3.3.1. Port states
          2. 15.3.3.2. Port roles
      4. 15.4. Bridge and Port IDs
      5. 15.5. Bridge Protocol Data Units (BPDUs)
        1. 15.5.1. Configuration BPDU
        2. 15.5.2. Priority Vector
        3. 15.5.3. When to Transmit Configuration BPDUs
        4. 15.5.4. BPDU Aging
      6. 15.6. Defining the Active Topology
        1. 15.6.1. Root Bridge Selection
        2. 15.6.2. Root Port Selection
        3. 15.6.3. Designated Port Selection
        4. 15.6.4. Examples of STP in Action
      7. 15.7. Timers
        1. 15.7.1. Avoiding Temporary Loops
      8. 15.8. Topology Changes
        1. 15.8.1. Short Aging Timer
        2. 15.8.2. Letting All Bridges Know About a Topology Change
        3. 15.8.3. Example of a Topology Change
      9. 15.9. BPDU Encapsulation
      10. 15.10. Transmitting Configuration BPDUs
      11. 15.11. Processing Ingress Frames
        1. 15.11.1. Ingress BPDUs
        2. 15.11.2. Ingress Configuration BPDUs
      12. 15.12. Convergence Time
      13. 15.13. Overview of Newer Spanning Tree Protocols
        1. 15.13.1. Rapid Spanning Tree Protocol (RSTP)
        2. 15.13.2. Multiple Spanning Tree Protocol (MSTP)
    3. 16. Bridging: Linux Implementation
      1. 16.1. Bridge Device Abstraction
      2. 16.2. Important Data Structures
      3. 16.3. Initialization of Bridging Code
      4. 16.4. Creating Bridge Devices and Bridge Ports
      5. 16.5. Creating a New Bridge Device
      6. 16.6. Bridge Device Setup Routine
      7. 16.7. Deleting a Bridge
      8. 16.8. Adding Ports to a Bridge
        1. 16.8.1. Deleting a Bridge Port
      9. 16.9. Enabling and Disabling a Bridge Device
      10. 16.10. Enabling and Disabling a Bridge Port
      11. 16.11. Changing State on a Bridge Port
      12. 16.12. The Big Picture
      13. 16.13. Forwarding Database
        1. 16.13.1. Lookups
        2. 16.13.2. Reference Counts
        3. 16.13.3. Adding, Updating, and Removing Entries
        4. 16.13.4. Aging
      14. 16.14. Handling Ingress Traffic
        1. 16.14.1. Data Frames Versus BPDUs
        2. 16.14.2. Processing Data Frames
      15. 16.15. Transmitting on a Bridge Device
      16. 16.16. Spanning Tree Protocol (STP)
        1. 16.16.1. Key Spanning Tree Routines
        2. 16.16.2. Bridge IDs and Port IDs
        3. 16.16.3. Enabling the Spanning Tree Protocol on a Bridge Device
        4. 16.16.4. Processing Ingress BPDUs
        5. 16.16.5. Transmitting BPDUs
        6. 16.16.6. Configuration Updates
        7. 16.16.7. Root Bridge Selection
          1. 16.16.7.1. Becoming the root bridge
          2. 16.16.7.2. Giving up the root bridge role
        8. 16.16.8. Timers
        9. 16.16.9. Handling Topology Changes
      17. 16.17. netdevice Notification Chain
    4. 17. Bridging: Miscellaneous Topics
      1. 17.1. User-Space Configuration Tools
        1. 17.1.1. Handling Configuration Changes
        2. 17.1.2. Old Interface Versus New Interface
        3. 17.1.3. Creating Bridge Devices and Bridge Ports
        4. 17.1.4. Configuring Bridge Devices and Ports
      2. 17.2. Tuning via /proc Filesystem
      3. 17.3. Tuning via /sys Filesystem
      4. 17.4. Statistics
      5. 17.5. Data Structures Featured in This Part of the Book
        1. 17.5.1. bridge_id Structure
        2. 17.5.2. net_bridge_fdb_entry Structure
        3. 17.5.3. net_bridge_port Structure
        4. 17.5.4. net_bridge Structure
      6. 17.6. Functions and Variables Featured in This Part of the Book
      7. 17.7. Files and Directories Featured in This Part of the Book
  7. V. Internet Protocol Version 4 (IPv4)
    1. 18. Internet Protocol Version 4 (IPv4): Concepts
      1. 18.1. IP Protocol: The Big Picture
      2. 18.2. IP Header
      3. 18.3. IP Options
        1. 18.3.1. "End of Option List" and "No Operation" Options
        2. 18.3.2. Source Route Option
        3. 18.3.3. Record Route Option
        4. 18.3.4. Timestamp Option
        5. 18.3.5. Router Alert Option
      4. 18.4. Packet Fragmentation/Defragmentation
        1. 18.4.1. Effect of Fragmentation on Higher Layers
        2. 18.4.2. IP Header Fields Used by Fragmentation/Defragmentation
        3. 18.4.3. Examples of Problems with Fragmentation/Defragmentation
          1. 18.4.3.1. Retransmissions
          2. 18.4.3.2. Associating fragments with their IP packets
          3. 18.4.3.3. Example of IP ID generation
          4. 18.4.3.4. Example of unsolvable defragmentation problem: NAT
        4. 18.4.4. Path MTU Discovery
      5. 18.5. Checksums
        1. 18.5.1. APIs for Checksum Computation
        2. 18.5.2. Changes to the L4 Checksum
    2. 19. Internet Protocol Version 4 (IPv4): Linux Foundations and Features
      1. 19.1. Main IPv4 Data Structures
        1. 19.1.1. Checksum-Related Fields from sk_buff and net_device Structures
          1. 19.1.1.1. net_device structure
          2. 19.1.1.2. sk_buff structure
      2. 19.2. General Packet Handling
        1. 19.2.1. Protocol Initialization
        2. 19.2.2. Interaction with Netfilter
        3. 19.2.3. Interaction with the Routing Subsystem
        4. 19.2.4. Processing Input IP Packets
        5. 19.2.5. The ip_rcv_finish Function
      3. 19.3. IP Options
        1. 19.3.1. Option Processing
        2. 19.3.2. Option Parsing
          1. 19.3.2.1. Option: strict and loose Source Routing
          2. 19.3.2.2. Option: Record Route
          3. 19.3.2.3. Option: Timestamp
          4. 19.3.2.4. Option: Router Alert
          5. 19.3.2.5. Handling parsing errors
    3. 20. Internet Protocol Version 4 (IPv4): Forwarding and Local Delivery
      1. 20.1. Forwarding
        1. 20.1.1. ICMP Redirect
        2. 20.1.2. ip_forward Function
        3. 20.1.3. ip_forward_finish Function
        4. 20.1.4. dst_output Function
      2. 20.2. Local Delivery
    4. 21. Internet Protocol Version 4 (IPv4): Transmission
      1. 21.1. Key Functions That Perform Transmission
        1. 21.1.1. Multicast Traffic
        2. 21.1.2. Relevant Socket Data Structures for Local Traffic
        3. 21.1.3. The ip_queue_xmit Function
          1. 21.1.3.1. Setting the route
          2. 21.1.3.2. Building the IP header
        4. 21.1.4. The ip_append_data Function
          1. 21.1.4.1. Basic memory allocation and buffer organization for ip_append_data
          2. 21.1.4.2. Memory allocation and buffer organization for ip_append_data with Scatter Gather I/O
          3. 21.1.4.3. Key routines for handling fragmented buffers
          4. 21.1.4.4. Further handling of the buffers
          5. 21.1.4.5. Setting the context
          6. 21.1.4.6. Getting ready for fragment generation
          7. 21.1.4.7. Copying data into the fragments: getfrag
          8. 21.1.4.8. Buffer allocation
          9. 21.1.4.9. Main loop
          10. 21.1.4.10. L4 checksum
        5. 21.1.5. The ip_append_page Function
        6. 21.1.6. The ip_push_pending_frames Function
        7. 21.1.7. Putting Together the Transmission Functions
        8. 21.1.8. Raw Sockets
      2. 21.2. Interface to the Neighboring Subsystem
    5. 22. Internet Protocol Version 4 (IPv4): Handling Fragmentation
      1. 22.1. IP Fragmentation
        1. 22.1.1. Functions Involved with IP Fragmentation
        2. 22.1.2. The ip_fragment Function
        3. 22.1.3. Slow Fragmentation
        4. 22.1.4. Fast Fragmentation
      2. 22.2. IP Defragmentation
        1. 22.2.1. Organization of the IP Fragments Hash Table
        2. 22.2.2. Key Issues in Defragmentation
        3. 22.2.3. Functions Involved with Defragmentation
        4. 22.2.4. New ipq Instance Initialization
        5. 22.2.5. The ip_defrag Function
        6. 22.2.6. The ip_frag_queue Function
          1. 22.2.6.1. Handling overlaps
          2. 22.2.6.2. L4 checksum
        7. 22.2.7. Garbage Collection
        8. 22.2.8. Hash Table Reorganization
    6. 23. Internet Protocol Version 4 (IPv4): Miscellaneous Topics
      1. 23.1. Long-Living IP Peer Information
        1. 23.1.1. Initialization
        2. 23.1.2. Lookups
        3. 23.1.3. How the IP Layer Uses inet_peer Structures
        4. 23.1.4. Garbage Collection
      2. 23.2. Selecting the IP Header's ID Field
      3. 23.3. IP Statistics
      4. 23.4. IP Configuration
        1. 23.4.1. Main Functions That Manipulate IP Addresses and Configuration
        2. 23.4.2. Change Notification: rtmsg_ifa
        3. 23.4.3. inetaddr_chain Notification Chain
        4. 23.4.4. IP Configuration via ip
        5. 23.4.5. IP Configuration via ifconfig
      5. 23.5. IP-over-IP
      6. 23.6. IPv4: What's Wrong with It?
      7. 23.7. Tuning via /proc Filesystem
      8. 23.8. Data Structures Featured in This Part of the Book
        1. 23.8.1. iphdr Structure
        2. 23.8.2. ip_options Structure
        3. 23.8.3. ipcm_cookie Structure
        4. 23.8.4. ipq Structure
        5. 23.8.5. inet_peer Structure
        6. 23.8.6. ipstats_mib Structure
        7. 23.8.7. in_device Structure
        8. 23.8.8. in_ifaddr Structure
        9. 23.8.9. ipv4_devconf Structure
        10. 23.8.10. ipv4_config Structure
        11. 23.8.11. cork Structure
        12. 23.8.12. skb_frag_t Structure
      9. 23.9. Functions and Variables Featured in This Part of the Book
      10. 23.10. Files and Directories Featured in This Part of the Book
    7. 24. Layer Four Protocol and Raw IP Handling
      1. 24.1. Available L4 Protocols
      2. 24.2. L4 Protocol Registration
        1. 24.2.1. Registration: inet_add_protocol and inet_del_protocol
      3. 24.3. L3 to L4 Delivery: ip_local_deliver_finish
        1. 24.3.1. Raw Sockets and Raw IP
        2. 24.3.2. Delivering Raw Input Datagrams to the Recipient Application
        3. 24.3.3. IPsec
      4. 24.4. IPv4 Versus IPv6
      5. 24.5. Tuning via /proc Filesystem
      6. 24.6. Functions and Variables Featured in This Chapter
      7. 24.7. Files and Directories Featured in This Chapter
    8. 25. Internet Control Message Protocol (ICMPv4)
      1. 25.1. ICMP Header
      2. 25.2. ICMP Payload
      3. 25.3. ICMP Types
        1. 25.3.1. ICMP_ECHO and ICMP_ECHOREPLY
        2. 25.3.2. ICMP_DEST_UNREACH
        3. 25.3.3. ICMP_SOURCE_QUENCH
        4. 25.3.4. ICMP_REDIRECT
        5. 25.3.5. ICMP_TIME_EXCEEDED
        6. 25.3.6. ICMP_PARAMETERPROB
        7. 25.3.7. ICMP_TIMESTAMP and ICMP_TIMESTAMPREPLY
        8. 25.3.8. ICMP_INFO_REQUEST and ICMP_INFO_REPLY
        9. 25.3.9. ICMP_ADDRESS and ICMP_ADDRESSREPLY
      4. 25.4. Applications of the ICMP Protocol
        1. 25.4.1. ping
        2. 25.4.2. traceroute
      5. 25.5. The Big Picture
      6. 25.6. Protocol Initialization
      7. 25.7. Data Structures Featured in This Chapter
        1. 25.7.1. icmphdr Structure
        2. 25.7.2. icmp_control Structure
        3. 25.7.3. icmp_bxm Structure
      8. 25.8. Transmitting ICMP Messages
        1. 25.8.1. Transmitting ICMP Error Messages
        2. 25.8.2. Replying to Ingress ICMP Messages
        3. 25.8.3. Rate Limiting
        4. 25.8.4. Implementation of Rate Limiting
        5. 25.8.5. Receiving ICMP Messages
        6. 25.8.6. Processing ICMP_ECHO and ICMP_ECHOREPLY Messages
        7. 25.8.7. Processing the Common ICMP Messages
        8. 25.8.8. Processing ICMP_REDIRECT Messages
        9. 25.8.9. Processing ICMP_TIMESTAMP and ICMP_TIMESTAMPREPLY Messages
        10. 25.8.10. Processing ICMP_ADDRESS and ICMP_ADDRESSREPLY Messages
      9. 25.9. ICMP Statistics
      10. 25.10. Passing Error Notifications to the Transport Layer
      11. 25.11. Tuning via /proc Filesystem
      12. 25.12. Functions and Variables Featured in This Chapter
      13. 25.13. Files and Directories Featured in This Chapter
  8. VI. Neighboring Subsystem
    1. 26. Neighboring Subsystem: Concepts
      1. 26.1. What Is a Neighbor?
      2. 26.2. Reasons That Neighboring Protocols Are Needed
        1. 26.2.1. When L3 Addresses Need to Be Translated to L2 Addresses
        2. 26.2.2. Shared Medium
        3. 26.2.3. Why Static Assignment of Addresses Is Not Sufficient
        4. 26.2.4. Special Cases
        5. 26.2.5. Solicitation Requests and Replies
      3. 26.3. Linux Implementation
        1. 26.3.1. Neighboring Protocols
      4. 26.4. Proxying the Neighboring Protocol
        1. 26.4.1. Conditions Required by the Proxy
      5. 26.5. When Solicitation Requests Are Transmitted and Processed
      6. 26.6. Neighbor States and Network Unreachability Detection (NUD)
        1. 26.6.1. Reachability
        2. 26.6.2. Transitions Between NUD States
          1. 26.6.2.1. Basic states
          2. 26.6.2.2. Derived states
          3. 26.6.2.3. Initial state
        3. 26.6.3. Reachability Confirmation
    2. 27. Neighboring Subsystem: Infrastructure
      1. 27.1. Main Data Structures
      2. 27.2. Common Interface Between L3 Protocols and Neighboring Protocols
        1. 27.2.1. Initialization of neigh->ops
        2. 27.2.2. Initialization of neigh->output and neigh->nud_state
          1. 27.2.2.1. Common state changes: neigh_connect and neigh_suspect
          2. 27.2.2.2. Routines used for neigh->output
        3. 27.2.3. Updating a Neighbor's Information: neigh_update
          1. 27.2.3.1. neigh_update optimization
          2. 27.2.3.2. Initial neigh_update operations
          3. 27.2.3.3. Changes of link layer address
          4. 27.2.3.4. Notifications to arpd
      3. 27.3. General Tasks of the Neighboring Infrastructure
        1. 27.3.1. Caching
        2. 27.3.2. Timers
      4. 27.4. Reference Counts on neighbour Structures
      5. 27.5. Creating a neighbour Entry
        1. 27.5.1. The neigh_create Function's Parameters
        2. 27.5.2. Neighbor Initialization
      6. 27.6. Neighbor Deletion
        1. 27.6.1. Garbage Collection
          1. 27.6.1.1. Synchronous cleanup: the neigh_forced_gc function
          2. 27.6.1.2. Asynchronous cleanup: the neigh_periodic_timer function
      7. 27.7. Acting As a Proxy
        1. 27.7.1. Delayed Processing of Solicitation Requests
        2. 27.7.2. Per-Device Proxying and Per-Destination Proxying
      8. 27.8. L2 Header Caching
        1. 27.8.1. Methods Provided by the Device Driver
        2. 27.8.2. Link Between Routing and L2 Header Caching
        3. 27.8.3. Cache Invalidation and Updating
      9. 27.9. Protocol Initialization and Cleanup
      10. 27.10. Interaction with Other Subsystems
        1. 27.10.1. Events Generated by the Neighboring Layer
        2. 27.10.2. Events Received by the Neighboring Layer
          1. 27.10.2.1. Updates via neigh_ifdown
          2. 27.10.2.2. Updates via neigh_changeaddr (netdevice notification chain)
      11. 27.11. Interaction Between Neighboring Protocols and L3 Transmission Functions
      12. 27.12. Queuing
        1. 27.12.1. Ingress Queuing
        2. 27.12.2. Egress Queuing
    3. 28. Neighboring Subsystem: Address Resolution Protocol (ARP)
      1. 28.1. ARP Packet Format
        1. 28.1.1. Destination Address Types for ARP Packets
      2. 28.2. Example of an ARP Transaction
      3. 28.3. Gratuitous ARP
        1. 28.3.1. Change of L2 Address
        2. 28.3.2. Duplicate Address Detection
        3. 28.3.3. Virtual IP
      4. 28.4. Responding from Multiple Interfaces
      5. 28.5. Tunable ARP Options
        1. 28.5.1. Compile-Time Options
        2. 28.5.2. /proc Options
          1. 28.5.2.1. ARP_ANNOUNCE
          2. 28.5.2.2. ARP_IGNORE
          3. 28.5.2.3. ARP_FILTER
          4. 28.5.2.4. Medium ID
      6. 28.6. ARP Protocol Initialization
        1. 28.6.1. The arp_tbl Table
      7. 28.7. Initialization of a neighbour Structure
        1. 28.7.1. Basic Initialization Sequence
        2. 28.7.2. Virtual Functions in the ops Field
        3. 28.7.3. Start of the arp_constructor Function
        4. 28.7.4. Devices That Do Not Need ARP
        5. 28.7.5. Devices That Need ARP
      8. 28.8. Transmitting and Receiving ARP Packets
        1. 28.8.1. Transmitting ARP Packets: Introduction to arp_send
        2. 28.8.2. Solicitations
          1. 28.8.2.1. ARP_ANNOUNCE and selection of source IP address
      9. 28.9. Processing Ingress ARP Packets
        1. 28.9.1. Initial Common Processing
        2. 28.9.2. Processing ARPOP_REQUEST Packets
          1. 28.9.2.1. Passive learning and ARP optimization
          2. 28.9.2.2. Requests with zero addresses
        3. 28.9.3. Processing ARPOP_REPLY Packets
        4. 28.9.4. Final Common Processing
      10. 28.10. Proxy ARP
        1. 28.10.1. Destination NAT (DNAT)
        2. 28.10.2. Proxy ARP Server as Router
      11. 28.11. Examples
      12. 28.12. External Events
        1. 28.12.1. Received Events
        2. 28.12.2. Generated Events
        3. 28.12.3. Wake-on-LAN Events
      13. 28.13. ARPD
        1. 28.13.1. Kernel Side
        2. 28.13.2. User-Space Side
      14. 28.14. Reverse Address Resolution Protocol (RARP)
      15. 28.15. Improvements in ND (IPv6) over ARP (IPv4)
    4. 29. Neighboring Subsystem: Miscellaneous Topics
      1. 29.1. System Administration of Neighbors
        1. 29.1.1. Common Routines
        2. 29.1.2. New-Generation Tool: IPROUTE2's ip Command
        3. 29.1.3. Old-Generation Tool: net-tools's arp Command
      2. 29.2. Tuning via /proc Filesystem
        1. 29.2.1. The /proc/sys/net/ipv4/neigh Directory
          1. 29.2.1.1. Initialization of global and per-device directories
          2. 29.2.1.2. Directory creation
        2. 29.2.2. The /proc/sys/net/ipv4/conf Directory
      3. 29.3. Data Structures Featured in This Part of the Book
        1. 29.3.1. neighbour Structure
        2. 29.3.2. neigh_table Structure
        3. 29.3.3. neigh_parms Structure
        4. 29.3.4. neigh_ops Structure
        5. 29.3.5. hh_cache Structure
        6. 29.3.6. neigh_statistics Structure
        7. 29.3.7. Data Structures Featured in This Part of the Book
      4. 29.4. Files and Directories Featured in This Part of the Book
  9. VII. Routing
    1. 30. Routing: Concepts
      1. 30.1. Routers, Routes, and Routing Tables
        1. 30.1.1. Nonrouting Multihomed Hosts
        2. 30.1.2. Varieties of Routing Configurations
        3. 30.1.3. Questions Answered in This Part of the Book
      2. 30.2. Essential Elements of Routing
        1. 30.2.1. Scope
          1. 30.2.1.1. Use of the scope
        2. 30.2.2. Default Gateway
        3. 30.2.3. Directed Broadcasts
        4. 30.2.4. Primary and Secondary Addresses
          1. 30.2.4.1. Old-generation configuration: aliasing interfaces
          2. 30.2.4.2. Relationship between aliasing devices and primary/secondary status
      3. 30.3. Routing Table
        1. 30.3.1. Special Routes
        2. 30.3.2. Route Types and Actions
        3. 30.3.3. Routing Cache
        4. 30.3.4. Routing Table Versus Routing Cache
        5. 30.3.5. Routing Cache Garbage Collection
          1. 30.3.5.1. Examples of events that can expire cache entries
          2. 30.3.5.2. Examples of eligible cache victims
      4. 30.4. Lookups
        1. 30.4.1. Longest Prefix Match
      5. 30.5. Packet Reception Versus Packet Transmission
    2. 31. Routing: Advanced
      1. 31.1. Concepts Behind Policy Routing
        1. 31.1.1. Lookup with Policy Routing
        2. 31.1.2. Routing Table Selection
      2. 31.2. Concepts Behind Multipath Routing
        1. 31.2.1. Next Hop Selection
        2. 31.2.2. Cache Support for Multipath
          1. 31.2.2.1. Weighted random algorithm
          2. 31.2.2.2. Device round-robin algorithm
        3. 31.2.3. Per-Flow, Per-Connection, and Per-Packet Distribution
          1. 31.2.3.1. Equalizer algorithm
      3. 31.3. Interactions with Other Kernel Subsystems
        1. 31.3.1. Routing Table Based Classifier
          1. 31.3.1.1. Configuring policy realms
          2. 31.3.1.2. Configuring route realms
          3. 31.3.1.3. Computing the routing tag
        2. 31.3.2. Policy Routing and Firewall-Based Classifier
      4. 31.4. Routing Protocol Daemons
      5. 31.5. Verbose Monitoring
      6. 31.6. ICMP_REDIRECT Messages
        1. 31.6.1. Shared Media
        2. 31.6.2. Transmitting ICMP_REDIRECT Messages
        3. 31.6.3. Processing Ingress ICMP_REDIRECT Messages
      7. 31.7. Reverse Path Filtering
    3. 32. Routing: Li nux Implementation
      1. 32.1. Kernel Options
        1. 32.1.1. Basic Options
        2. 32.1.2. Advanced Options
        3. 32.1.3. Recently Dropped Options
      2. 32.2. Main Data Structures
        1. 32.2.1. Lists and Hash Tables
      3. 32.3. Route and Address Scopes
        1. 32.3.1. Route Scopes
        2. 32.3.2. Address Scopes
        3. 32.3.3. Relationship Between Route and Next-Hop Scopes
      4. 32.4. Primary and Secondary IP Addresses
      5. 32.5. Generic Helper Routines and Macros
      6. 32.6. Global Locks
      7. 32.7. Routing Subsystem Initialization
      8. 32.8. External Events
        1. 32.8.1. Helper Routines
        2. 32.8.2. Changes in IP Configuration
          1. 32.8.2.1. Adding an IP address
          2. 32.8.2.2. Removing an IP address
        3. 32.8.3. Changes in Device Status
          1. 32.8.3.1. Impacts on the routing tables
          2. 32.8.3.2. Impacts on the policy database
          3. 32.8.3.3. Impacts on the IP configuration
      9. 32.9. Interactions with Other Subsystems
        1. 32.9.1. Netlink Notifications
        2. 32.9.2. Policy Routing and Firewall-Based Classifier
        3. 32.9.3. Routing Protocol Daemons
    4. 33. Routing: The Routing Cache
      1. 33.1. Routing Cache Initialization
      2. 33.2. Hash Table Organization
      3. 33.3. Major Cache Operations
        1. 33.3.1. Cache Locking
        2. 33.3.2. Cache Entry Allocation and Reference Counts
        3. 33.3.3. Adding Elements to the Cache
        4. 33.3.4. Binding the Route Cache to the ARP Cache
        5. 33.3.5. Cache Lookup
          1. 33.3.5.1. Ingress lookup
          2. 33.3.5.2. Egress lookup
      4. 33.4. Multipath Caching
        1. 33.4.1. Registering a Caching Algorithm
        2. 33.4.2. Interface Between the Routing Cache and Multipath
        3. 33.4.3. Helper Routines
        4. 33.4.4. Common Elements Between Algorithms
        5. 33.4.5. Random Algorithm
        6. 33.4.6. Weighted Random Algorithm
        7. 33.4.7. Round-Robin Algorithm
        8. 33.4.8. Device Round-Robin Algorithm
      5. 33.5. Interface Between the DST and Calling Protocols
        1. 33.5.1. IPsec Transformations and the Use of dst_entry
        2. 33.5.2. External Events
      6. 33.6. Flushing the Routing Cache
      7. 33.7. Garbage Collection
        1. 33.7.1. Synchronous Cleanup
        2. 33.7.2. rt_garbage_collect Function
        3. 33.7.3. Asynchronous Cleanup
        4. 33.7.4. Expiration Criteria
        5. 33.7.5. Deleting DST Entries
        6. 33.7.6. Variables That Tune and Control Garbage Collection
      8. 33.8. Egress ICMP REDIRECT Rate Limiting
    5. 34. Routing: Routing Tables
      1. 34.1. Organization of Routing Hash Tables
        1. 34.1.1. Organization of Per-Netmask Tables
          1. 34.1.1.1. Basic structures for hash table organization
          2. 34.1.1.2. Dynamic resizing of per-netmask hash tables
        2. 34.1.2. Organization of fib_info Structures
          1. 34.1.2.1. Dynamic resizing of global hash tables
        3. 34.1.3. Organization of Next-Hop Router Structures
        4. 34.1.4. The Two Default Routing Tables: ip_fib_main_table and ip_fib_local_table
      2. 34.2. Routing Table Initialization
      3. 34.3. Adding and Removing Routes
        1. 34.3.1. Adding a Route
        2. 34.3.2. Deleting a Route
        3. 34.3.3. Garbage Collection
      4. 34.4. Policy Routing and Its Effects on Routing Table Definitions
        1. 34.4.1. Variable and Structure Definitions
        2. 34.4.2. Double Definitions for Functions
    6. 35. Routing: Lookups
      1. 35.1. High-Level View of Lookup Functions
      2. 35.2. Helper Routines
      3. 35.3. The Table Lookup: fn_hash_lookup
        1. 35.3.1. Semantic Matching on Subsidiary Criteria
          1. 35.3.1.1. Criteria for rejecting routes
          2. 35.3.1.2. Return value from fib_semantic_match
      4. 35.4. fib_lookup Function
      5. 35.5. Setting Functions for Reception and Transmission
        1. 35.5.1. Initialization of Function Pointers for Ingress Traffic
        2. 35.5.2. Initialization of Function Pointers for Egress Traffic
        3. 35.5.3. Special Cases
      6. 35.6. General Structure of the Input and Output Routing Routines
      7. 35.7. Input Routing
        1. 35.7.1. Creation of a Cache Entry
        2. 35.7.2. Preferred Source Address Selection
        3. 35.7.3. Local Delivery
        4. 35.7.4. Forwarding
        5. 35.7.5. Routing Failure
      8. 35.8. Output Routing
        1. 35.8.1. Search Key Initialization
        2. 35.8.2. Selecting the Source IP Address
        3. 35.8.3. Local Delivery
        4. 35.8.4. Transmission to Other Hosts
        5. 35.8.5. Interaction Between Multipath and Default Gateway Selection
        6. 35.8.6. Default Gateway Selection
        7. 35.8.7. fn_hash_select_default Function
      9. 35.9. Effects of Multipath on Next Hop Selection
        1. 35.9.1. Multipath Caching
      10. 35.10. Policy Routing
        1. 35.10.1. fib_lookup with Policy Routing
        2. 35.10.2. Default Gateway Selection with Policy Routing
      11. 35.11. Source Routing
      12. 35.12. Policy Routing and Routing Table Based Classifier
        1. 35.12.1. Storing the Realms
        2. 35.12.2. Helper Routines
        3. 35.12.3. Computing the Routing Tag
    7. 36. Routing: Miscellaneous Topics
      1. 36.1. User-Space Configuration Tools
        1. 36.1.1. Configuring Routing with IPROUTE2
          1. 36.1.1.1. Correspondence between IPROUTE2 user commands and kernel functions
          2. 36.1.1.2. inet_rtm_newroute and inet_rtm_delroute functions
        2. 36.1.2. Configuring Routing with net-tools
        3. 36.1.3. Change Notifications
        4. 36.1.4. Routes Inserted by the Kernel: The fib_magic Function
      2. 36.2. Statistics
      3. 36.3. Tuning via /proc Filesystem
        1. 36.3.1. The /proc/sys/net/ipv4 Directory
        2. 36.3.2. The /proc/sys/net/ipv4/route Directory
        3. 36.3.3. The /proc/sys/net/ipv4/conf Directory
          1. 36.3.3.1. Special subdirectories
          2. 36.3.3.2. Use of the special subdirectories
          3. 36.3.3.3. File descriptions
        4. 36.3.4. The /proc/net and /proc/net/stat Directories
      4. 36.4. Enabling and Disabling Forwarding
      5. 36.5. Data Structures Featured in This Part of the Book
        1. 36.5.1. fib_table Structure
        2. 36.5.2. fn_zone Structure
        3. 36.5.3. fib_node Structure
        4. 36.5.4. fib_alias Structure
        5. 36.5.5. fib_info Structure
        6. 36.5.6. fib_nh Structure
        7. 36.5.7. fib_rule Structure
        8. 36.5.8. fib_result Structure
        9. 36.5.9. rtable Structure
        10. 36.5.10. dst_entry Structure
        11. 36.5.11. dst_ops Structure
        12. 36.5.12. flowi Structure
        13. 36.5.13. rt_cache_stat Structure
        14. 36.5.14. ip_mp_alg_ops Structure
      6. 36.6. Functions and Variables Featured in This Part of the Book
      7. 36.7. Files and Directories Featured in This Part of the Book
  10. About the Author
  11. Colophon
  12. Copyright