Understanding Linux Network Internals

Chapter 1. Introduction

To do research in the source code of a large project is to enter a strange, new land with its own customs and unspoken expectations. It is useful to learn some of the major conventions up front, and to try interacting with the inhabitants instead of merely standing back and observing.

The bulk of this chapter is devoted to introducing you to a few of the common programming patterns and tricks that you’ll often meet in the networking code.

I encourage you, when possible, to try interacting with a given part of the kernel networking code by means of user-space tools. So in this chapter, I’ll give you a few pointers as to where you can download those tools if they’re not already installed on your preferred Linux distribution, or if you simply want to upgrade them to the latest versions.

I’ll also describe some tools that let you find your way gracefully through the enormous kernel code. Finally, I’ll explain briefly why a kernel feature may not be integrated into the official kernel releases, even if it is widely used in the Linux community.

Basic Terminology

In this section, I’ll introduce terms and abbreviations that are going to be used extensively in this book.

Eight-bit quantities are normally called octets in the networking literature. In this book, however, I use the more familiar term byte. After all, the book describes the behavior of the kernel rather than some network abstraction, and kernel developers are used to thinking in terms of bytes .

The terms vector and array will be used interchangeably.

When referring to the layers of the TCP/IP network stack, I will use the abbreviations L2, L3, and L4 to refer to the link, network, and transport layers, respectively. The numbers are based on the famous (if not exactly current) seven-layer OSI model. In most cases, L2 will be a synonym for Ethernet, L3 for IP Version 4 or 6, and L4 for UDP, TCP, or ICMP. When I need to refer to a specific protocol, I’ll use its name (i.e., TCP) rather than the generic Ln protocol term.

In different chapters, we will see how data units are received and transmitted by the protocols that sit at a given layer in the network stack. In those contexts, the terms ingress and input will be used interchangeably. The same applies to egress and output. The action of receiving or transmitting a data unit may be referred to with the abbreviations RX and TX, respectively.

A data unit is given different names, such as frame, packet, segment, and message, depending on the layer where it is used (see Chapter 13 for more details). Table 1-1 summarizes the major abbreviations you’ll see in the book.

Table 1-1. Abbreviations used frequently in this book

Abbreviation	Meaning
L2	Link layer (e.g., Ethernet)
L3	Network layer (e.g., IP)
L4	Transport layer (e.g., UDP/TCP/ICMP)
BH	Bottom half
IRQ	Interrupt
RX	Reception
TX	Transmission

Common Coding Patterns

Each networking feature, like any other kernel feature, is just one of the citizens inside the kernel. As such, it must make proper and fair use of memory, CPU, and all other shared resources. Most features are not written as standalone pieces of kernel code, but interact with other kernel components more or less heavily depending on the feature. They therefore try, as much as possible, to follow similar mechanisms to implement similar functionalities (there is no need to reinvent the wheel every time).

Some requirements are common to several kernel components, such as the need to allocate several instances of the same data structure type, the need to keep track of references to an instance of a data structure to avoid unsafe memory deallocations, etc. In the following subsections, we will view common ways in Linux to handle such requirements. I will also talk about common coding tricks that you may come across while browsing the kernel’s code.

This book uses subsystem as a loose term to describe a collection of files that implement a major set of features—such as IP or routing—and that tend to be maintained by the same people and to change in lockstep. In the rest of the chapter, I’ll also use the term kernel component to refer to these subsystems, because the conventions discussed here apply to most parts of the kernel, not just those involved in networking.

Memory Caches

The kernel uses the kmalloc and kfree functions to allocate and free a memory block, respectively. The syntax of those two functions is similar to that of the two sister calls, malloc and free, from the libc user-space library. For more details on kmalloc and kfree, please refer to Linux Device Drivers (O’Reilly).

It is common for a kernel component to allocate several instances of the same data structure type. When allocation and deallocation are expected to happen often, the associated kernel component initialization routine (for example, fib_hash_init for the routing table) usually allocates a special memory cache that will be used for the allocations. When a memory block is freed, it is actually returned to the same cache from which it was allocated.

Some examples of network data structures for which the kernel maintains dedicated memory caches include:

Socket buffer descriptors: This cache, allocated by skb_init in net/core/sk_buff.c, is used for the allocation of sk_buff buffer descriptors. The sk_buff structure is probably the one that registers the highest number of allocations and deallocations in the networking subsystem.
Neighboring protocol mappings: Each neighboring protocol uses a memory cache to allocate the data structures that store L3-to-L2 address mappings. See Chapter 27.
Routing tables: The routing code uses two memory caches for two of the data structures that define routes. See Chapter 32.

Here are the key kernel functions used to deal with memory caches:

kmem_cache_create
kmem_cache_destroy: Create and destroy a cache.
kmem_cache_alloc
kmem_cache_free: Allocate and return a buffer to the cache. They are usually called via wrappers, which manage the requests for allocation and deallocation at a higher level. For example, the request to free an instance of an sk_buff buffer with kfree_skb ends up calling kmem_cache_free only when all the references to the buffer have been released and all the necessary cleanup has been done by the interested subsystems (for instance, the firewall).

The limit on the number of instances that can be allocated from a given cache (when present) is usually enforced by the wrappers around kmem_cache_alloc, and are sometimes configurable with a parameter in /proc.

For more details on how memory caches are implemented and how they interface to the slab allocator, please refer to Understanding the Linux Kernel (O’Reilly).

Caching and Hash Tables

It is pretty common to use a cache to increase performance. In the networking code, there are caches for L3-to-L2 mappings (such as the ARP cache used by IPv4), for the routing table cache, etc.

Cache lookup routines often take an input parameter that says whether a cache miss should or should not create a new element and add it to the cache. Other lookup routines simply add missing elements all the time.

Caches are often implemented with hash tables . The kernel provides a set of data types, such as one-way and bidirectional lists, that can be used as building blocks for simple hash tables.

The standard way to handle inputs that hash to the same value is to put them in a list. Traversing this list takes substantially longer than using the hash key to do a lookup. Therefore, it is always important to minimize the number of inputs that hash to the same value.

When the lookup time on a hash table (whether it uses a cache or not) is a critical parameter for the owner subsystem, it may implement a mechanism to increase the size of the hash table so that the average length of the collision lists goes down and the average lookup time improves. See the section "Dynamic resizing of per-netmask hash tables" in Chapter 34 for an example.

You may also find subsystems, such as the neighboring layer, that add a random component (regularly changed) to the key used to distribute elements in the cache’s buckets. This is used to reduce the damage of Denial of Service (DoS) attacks aimed at concentrating the elements of a hash table into a single bucket. See the section "Caching" in Chapter 27 for an example.

Reference Counts

When a piece of code tries to access a data structure that has already been freed, the kernel is not very happy, and the user is rarely happy with the kernel’s reaction. To avoid those nasty problems, and to make garbage collection mechanisms easier and more effective (see the section "Garbage Collection" later in this chapter), most data structures keep a reference count. Good kernel citizens increment and decrement the reference count of every data structure every time they save and release a reference, respectively, to the structure. For any data structure type that requires a reference count, the kernel component that owns the structure usually exports two functions that can be used to increment and decrement the reference count. Such functions are usually called xxx _hold and xxx _release, respectively. Sometimes the release function is called xxx _put instead (e.g., dev_put for net_device structures).

While we like to assume there are no bad citizens in the kernel, developers are human, and as such they do not always write bug-free code. The use of the reference count is a simple but effective mechanism to avoid freeing still-referenced data structures. However, it does not always solve the problem completely. This is the consequence of forgetting to balance increments and decrements:

If you release a reference to a data structure but forget to call the xxx _release function, the kernel will never allow the data structure to be freed (unless another buggy piece of code happens to call the release function an extra time by mistake!). This leads to gradual memory exhaustion.
If you take a reference to a data structure but forget to call xxx _hold, and at some later point you happen to be the only reference holder, the structure will be prematurely freed because you are not accounted for. This case definitely can be more catastrophic than the previous one; your next attempt to access the structure can corrupt other data or cause a kernel panic that brings down the whole system instantly.

When a data structure is to be removed for some reason, the reference holders can be explicitly notified about its going away so that they can politely release their references. This is done through notification chains. See the section "Reference Counts" in Chapter 8 for an interesting example.

The reference count on a data structure typically can be incremented when:

There is a close relationship between two data structure types. In this case, one of the two often maintains a pointer initialized to the address of the second one.
A timer is started whose handler is going to access the data structure. When the timer is fired, the reference count on the structure is incremented, because the last thing you want is for the data structure to be freed before the timer expires.
A successful lookup on a list or a hash table returns a pointer to the matching element. In most cases, the returned result is used by the caller to carry out some task. Because of that, it is common for a lookup routine to increase the reference count on the matching element, and let the caller release it when necessary.

When the last reference to a data structure is released, it may be freed because it is not needed anymore, but not necessarily.

The introduction of the new sysfs filesystem has helped to make a good portion of the kernel code more aware of reference counts and consistent in its use of them.

Garbage Collection

Memory is a shared and limited resource and should not be wasted, particularly in the kernel because it does not use virtual memory. Most kernel subsystems implement some sort of garbage collection to reclaim the memory held by unused or stale data structure instances. Depending on the needs of any given feature, you will find two main kinds of garbage collection:

Asynchronous: This type of garbage collection is unrelated to particular events. A timer that expires regularly invokes a routine that scans a set of data structures and frees the ones considered eligible for deletion. The conditions that make a data structure eligible for deletion depend on the features and logic of the subsystem, but a common criterion is the presence of a null reference count.
Synchronous: There are cases where a shortage of memory, which cannot wait for the asynchronous garbage collection timer to kick in, triggers immediate garbage collection. The criteria used to select the data structures eligible for deletion are not necessarily the same ones used by asynchronous cleanup (for instance, they could be more aggressive). See Chapter 33 for an example.

In Chapter 7, you will see how the kernel manages to reclaim the memory used by initialization routines and that is no longer needed after they have been executed.

Function Pointers and Virtual Function Tables (VFTs)

Function pointers are a convenient way to write clean C code while getting some of the benefits of the object-oriented languages. In the definition of a data structure type (the object), you include a set of function pointers (the methods). Some or all manipulations of the structure are then done through the embedded functions. C-language function pointers in data structures look like this:

struct sock {
    ...
    void    (*sk_state_change)(struct sock *sk);
    void    (*sk_data_ready)(struct sock *sk, int bytes);
    ...

};

A key advantage to using function pointers is that they can be initialized differently depending on various criteria and the role played by the object. Thus, invoking sk_state_change may actually invoke different functions for different sock sockets.

Function pointers are used extensively in the networking code. The following are only a few examples:

When an ingress or egress packet is processed by the routing subsystem, it initializes two routines in the buffer data structure. You will see this in Chapter 35. Refer to Chapter 2 for a complete list of function pointers included in the sk_buff data structure.
When a packet is ready for transmission on the networking hardware, it is handed to the hard_start_xmit function pointer of the net_device data structure. That routine is initialized by the device driver associated with the device.
When an L3 protocol wants to transmit a packet, it invokes one of a set of function pointers. These have been initialized to a set of routines by the address resolution protocol associated with the L3 protocol. Depending on the actual routine to which the function pointer is initialized, a transparent L3-to-L2 address resolution may take place (for example, IPv4 packets go through ARP). When the address resolution is unnecessary, a different routine is used. See Part VI for a detailed discussion on this interface.

We see in the preceding examples how function pointers can be employed as interfaces between kernel components or as generic mechanisms to invoke the right function handler at the right time based on the result of something done by a different subsystem. There are cases where function pointers are also used as a simple way to allow protocols, device drivers, or any other feature to personalize an action.

Let’s look at an example. When a device driver registers a network device with the kernel, it goes through a series of steps that are needed regardless of the device type. At some point, it invokes a function pointer on the net_device data structure to let the device driver do something extra if needed. The device driver could either initialize that function pointer to a function of its own, or leave the pointer NULL because the default steps performed by the kernel are sufficient.

A check on the value of a function pointer is always necessary before executing it to avoid NULL pointer dereferences, as shown in this snapshot from register_netdevice:

    if (dev->init && dev->init(dev) != 0) {
        ...
    }

Function pointers have one main drawback: they make browsing the source code a little harder. While going through a given code path, you may end up focusing on a function pointer call. In such cases, before proceeding down the code path, you need to find out how the function pointer has been initialized. It could depend on different factors:

When the selection of the routine to assign to a function pointer is based on a particular piece of data, such as the protocol handling the data or the device driver a given packet is received from, it is easier to derive the routine. For example, if a given device is managed by the drivers/net/3c59x.c device driver, you can derive the routine to which a given function pointer of the net_device data structure is initialized by reading the device initialization routine provided by the device driver.
When the selection of the routine is based instead on more complex logic, such as the state of the resolution of an L3-to-L2 address mapping, the routine used at any time depends on external events that cannot be predicted.

A set of function pointers grouped into a data structure are often referred to as a virtual function table (VFT). When a VFT is used as the interface between two major subsystems, such as the L3 and L4 protocol layers, or when the VFT is simply exported as an interface to a generic kernel component (set of objects), the number of function pointers in it may swell to include many different pointers that accommodate a wide range of protocols or other features. Each feature may end up using only a few of the many functions provided. You will see an example in Part VI. Of course, if this use of a VFT is taken too far, it becomes cumbersome and a major redesign is needed.

goto Statements

Few C programmers like the goto statement. Without getting into the history of the goto (one of the longest and most famous controversies in computer programming), I’ll summarize some of the reasons the goto is usually deprecated, but why the Linux kernel uses it anyway.

Any piece of code that uses goto can be rewritten without it. The use of goto statements can reduce the readability of the code, and make debugging harder, because at any position following a goto you can no longer derive unequivocally the conditions that led the execution to that point.

Let me make this analogy: given any node in a tree, you know what the path from the root to the node is. But if you add vines that entwine around branches randomly, you do not always have a unique path between the root and the other nodes anymore.

However, because the C language does not provide explicit exceptions (and they are often avoided in other languages as well because of the performance hit and coding complexity), carefully placed goto statements can make it easier to jump to code that handles undesired or peculiar events. In kernel programming, and particularly in networking, such events are very common, so goto becomes a convenient tool.

I must defend the kernel’s use of goto by pointing out that developers have by no means gone wild with it. Even though there are more than 30,000 instances, they are mainly used to handle different return codes within a function, or to jump out of more than one level of nesting.

Vector Definitions

In some cases, the definition of a data structure includes an optional block at the end. This is an example:

struct abc {
    int age;
    char *name[20];
    ...
    char    placeholder[0];
}

The optional block starts with placeholder. Note that placeholder is defined as a vector of size 0. This means that when abc is allocated with the optional block, placeholder points to the beginning of the block. When no optional block is required, placeholder is just a pointer to the end of the structure; it does not consume any space.

Thus, if abc is used by several pieces of code, each one can use the same basic definition (avoiding the confusion of doing the same thing in slightly different ways) while extending abc differently to personalize its definition according to its needs.

We will see this kind of data structure definition a few times in the book. One example is in Chapter 19.

Conditional Directives (#ifdef and family)

Conditional directives to the compiler are sometimes necessary. An excessive use of them can reduce the readability of the code, but I can state that Linux does not abuse them. They appear for different reasons, but the ones we are interested in are those used to check whether a given feature is supported by the kernel. Configuration tools such as make xconfig determine whether the feature is compiled in, not supported at all, or loadable as a module.

Examples of feature checks by #ifdef or #if defined C preprocessor directives are:

To include or exclude fields from a data structure definition

struct sk_buff {
    ...
#ifdef CONFIG_NETFILTER_DEBUG
    unsigned int nf_debug;
#endif
    ...
}

In this example, the Netfilter debugging feature requires an nf_debug field in the sk_buff structure. When the kernel does not have support for Netfilter debugging (a feature needed by only a handful of developers), there is no need to include the field, which would just take up more memory for every network packet.

To include or exclude pieces of code from a function

int ip_route_input(...)
{
    ...
        if (rth->fl.fl4_dst == daddr &&
            rth->fl.fl4_src == saddr &&
            rth->fl.iif == iif &&
            rth->fl.oif == 0 &&
#ifndef CONFIG_IP_ROUTE_FWMARK
            rth->fl.fl4_fwmark == skb->nfmark &&
#endif
            rth->fl.fl4_tos == tos) {
            ...
        }
}

The routing cache lookup routine ip_route_input, described in Chapter 33, checks the value of the tag set by the firewall only when the kernel has been compiled with support for the “IP: use netfilter MARK value as routing key” feature.

To select the right prototype for a function

#ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_table * fib_hash_init(int id)
#else
struct fib_table * _ _init fib_hash_init(int id)
{
    ...
}

In this example, the directives are used to add the _ _init tag^[*] to the prototype when the kernel does not have support for Policy Routing.

To select the right definition for a function

#ifndef CONFIG_IP_MULTIPLE_TABLES
...
static inline struct fib_table *fib_get_table(int id)
{
    if (id != RT_TABLE_LOCAL)
        return ip_fib_main_table;
    return ip_fib_local_table
}
...
#else
...
static inline struct fib_table *fib_get_table(int id)
{
    if (id == 0)
        id = RT_TABLE_MAIN;
    return fib_tables[id];
}
...
#endif

Note that this case differs from the previous one. In the previous case, the function body lies outside the #ifdef/#endif blocks, whereas in this case, each block contains a complete definition of the function.

The definition or initialization of variables and macros can also use conditional compilation.

It is important to know about the existence of multiple definitions of certain functions or macros, whose selection at compile time is based on a preprocessor macro as in the preceding examples. Otherwise, when you look for a function, variable, or macro definition, you may be looking at the wrong one.

See Chapter 7 for a discussion of how the introduction of special macros has reduced, in some cases, the use of conditional compiler directives.

Compile-Time Optimization for Condition Checks

Most of the time, when the kernel compares a variable against some external value to see whether a given condition is met, the result is extremely likely to be predictable. This is pretty common, for example, with code that enforces sanity checks. The kernel uses the likely and unlikely macros, respectively, to wrap comparisons that are likely to return a true (1) or false (0) result. Those macros take advantage of a feature of the gcc compiler that can optimize the compilation of the code based on that information.

Here is an example. Let’s suppose you need to call the do_something function, and that in case of failure, you must handle it with the handle_error function:

err = do_something(x,y,z);
if (err)
    handle_error(err);

Under the assumption that do_something rarely fails, you can rewrite the code as follows:

err = do_something(x,y,z);
if (unlikely(err))
    handle_error(err);

An example of the optimization made possible by the likely and unlikely macros is in handling options in the IP header. The use of IP options is limited to very specific cases, and the kernel can safely assume that most IP packets do not carry IP options. When the kernel forwards an IP packet, it needs to take care of options according to the rules described in Chapter 18. The last stage of forwarding an IP packet is taken care of by ip_forward_finish. This function uses the unlikely macro to wrap the condition that checks whether there is any IP option to take care of. See the section "ip_forward_finish Function" in Chapter 20.

Mutual Exclusion

Locking is used extensively in the networking code, and you are likely to see it come up as an issue under every topic in this book. Mutual exclusion, locking mechanisms, and synchronization are a general topic—and a highly interesting and complex one—for many types of programming, especially kernel programming. Linux has seen the introduction and optimization of several approaches to mutual exclusion over the years. Thus, this section merely summarizes the locking mechanisms seen in networking code; I refer you to the high-quality, detailed discussions available in O’Reilly’s Understanding the Linux Kernel and Linux Device Driver.

Each mutual exclusion mechanism is the best choice for particular circumstances. Here is a brief summary of the alternative mutual exclusion approaches you will see often in the networking code:

Spin locks

This is a lock that can be held by only one thread of execution at a time. An attempt to acquire the lock by another thread of execution makes the latter loop until the lock is released. Because of the waste caused by looping, spin locks are used only on multiprocessor systems, and generally are used only when the developer expects the lock to be held for short intervals. Also because of the waste caused to other threads, a thread of execution must not sleep while holding a spin lock.

Read-write spin locks

When the uses of a given lock can be clearly classified as read-only and read-write, the use of read-write spin locks is preferred. The difference between spin locks and read-write spin locks is that in the latter, multiple readers can hold the lock at the same time. However, only one writer at a time can hold the lock, and no reader can acquire it when it is already held by a writer. Because readers are given higher priority over writers, this type of lock performs well when the number of readers (or the number of read-only lock acquisitions) is a good deal bigger than the number of writers (or the number or read-write lock acquisitions).

When the lock is acquired in read-only mode, it cannot be promoted to read-write mode directly: the lock must be released and reacquired in read-write mode.

Read-Copy-Update (RCU)

RCU is one of the latest mechanisms made available in Linux to provide mutual exclusion. It performs quite well under the following specific conditions:

Read-write lock requests are rare compared to read-only lock requests.
The code that holds the lock is executed atomically and does not sleep.
The data structures protected by the lock are accessed via pointers.

The first condition concerns performance, and the other two are at the base of the RCU working principle.

Note that the first condition would suggest the use of read-write spin locks as an alternative to RCU. To understand why RCU, when its use is appropriate, performs better than read-write spin locks, you need to consider other aspects, such as the effect of the processor caches on SMP systems.

The working principle behind the design of RCU is simple yet powerful. For a clear description of the advantages of RCU and a brief description of its implementation, refer to an article published by its author, Paul McKenney, in the Linux Journal (http://linuxjournal.com/article/6993).^[*] You can also refer to Understanding the Linux Kernel and Linux Device Drivers.

An example where RCU is used in the networking code is the routing subsystem. Lookups are more frequent than updates on the cache, and the routine that implements the routing cache lookup does not block in the middle of the search. See Chapter 33.

Semaphores are offered by the kernel but are rarely used in the networking code covered in this book. One example, however, is the code used to serialize configuration changes, which we will see in action in Chapter 8.

Conversions Between Host and Network Order

Data structures spanning more than one byte can be stored in memory with two different formats: Little Endian and Big Endian. The first format stores the least significant byte at the lowest memory address, and the second does the opposite. The format used by an operating system such as Linux depends on the processor in use. For example, Intel processors follow the Little Endian model, and Motorola processors use the Big Endian model.

Suppose our Linux box receives an IP packet from a remote host. Because it does not know which format, Little Endian or Big Endian, was used by the remote host to initialize the protocol headers, how will it read the header? For this reason, each protocol family must define what “endianness " it uses. The TCP/IP stack, for example, follows the Big Endian model.

But this still leaves the kernel developer with a problem: she must write code that can run on many different processors that support different endianness. Some processors might match the endianness of the incoming packet, but those that do not require conversion to the endianness used by the processor.

Therefore, every time the kernel needs to read, save, or compare a field of the IP header that spans more than one byte, it must first convert it from network byte order to host byte order or vice versa. The same applies to the other protocols of the TCP/IP stack. When both the protocol and the local host are Big Endian, the conversion routines are simply no-ops because there is no need for any conversion. They always appear in the code to make the code portable; only the conversion routines themselves are platform dependent. Table 1-2 lists the main macros used for the conversion of two-byte and four-byte fields.

Table 1-2. Byte-ordering conversion routines

Macro	Meaning (short is 2 bytes, long is 4 bytes)
`htons`	Host-to-network byte order (short)
`htonl`	Host-to-network byte order (long)
`ntohs`	Network-to-host byte order (short)
`ntohl`	Network-to-host byte order (long)

The macros are defined in the generic header file include/linux/byteorder/generic.h. This is how each architecture tailors the definition of those macros based on their endianness:

For each architecture there is a byteorder.h file in the per-architecture directory include/asm-XXX/.
That file includes either include/linux/byteorder/big_endian.h or include/linux/byteorder/little_endian.h, depending on the processor’s endianness.
Both little_endian.h and big_endian.h include the generic file include/linux/byteorder/generic.h. The definitions of the macros in Table 1-2 are based on other macros that are defined differently by little_endian.h and big_endian.h, and this is how the endianness of the architecture influences the definition of the macros of Table 1-2.

For each macro xxx in Table 1-2 there is a sister macro, _ _constant_ xxx, that is used when the input field is a constant value, such as an element of an enumeration list (see the section "ARP Protocol Initialization" in Chapter 28 for an example). Note that the macros in Table 1-2 are commonly used in the kernel code even when their input is a constant value (see the section "Setting the Ethernet Protocol and Length" in Chapter 13 for an example).

We said earlier in the section that endianness is important when a data field spans more than one byte. Endianness is actually important also when a field of one or more bytes is defined as a collection of bitfields. See, for example, what the IPv4 header looks like in Figure 18-2 in Chapter 18, and how the kernel defines the iphdr structure in include/linux/ip.h. The kernel defines _ _LITTLE_ENDIAN_BITFIELD and _ _BIG_ENDIAN_BITFIELD, respectively, in the little_endian.h and big_endian.h files mentioned earlier.

Catching Bugs

A few functions are supposed to be called under specific conditions, or are not supposed to be called under certain conditions. The kernel uses the BUG_ON and BUG_TRAP macros to catch cases where such conditions are not met. When the input condition to BUG_TRAP is false, the kernel prints a warning message. BUG_ON instead prints an error message and panics.

Statistics

It is a good habit for a feature to collect statistics about the occurrence of specific conditions, such as cache lookup successes and failures, memory allocation successes and failures, etc. For each networking feature that collects statistics, this book lists and describes each counter.

Measuring Time

The kernel often needs to measure how much time has passed since a given moment. For example, a routine that carries on a CPU-intensive task often releases the CPU after a given amount of time. It will continue its job when it is rescheduled for execution. This is especially important in kernel code, even though the kernel supports kernel preemption. A common example in the networking code is given by the routines that implement garbage collection. We will see plenty in this book.

The passing of time in kernel space is measured in ticks . A tick is the time between two consecutive expirations of the timer interrupt. The timer takes care of different tasks (we are not interested in them here) and regularly expires HZ times per second. HZ is a variable initialized by architecture-dependent code. For example, it is initialized to 1,000 on i386 machines. This means that the timer interrupt expires 1,000 times per second when Linux runs on an i386 system, and that there is one millisecond between two consecutive expirations.

Every time the timer expires it increments the global variable called jiffies. This means that at any time, jiffies represents the number of ticks since the system booted, and the generic value n*HZ represents n seconds of time.

If all a function needs is to measure the passing of time, it can save the value of jiffies into a local variable and later compare the difference between jiffies and that timestamp against a time interval (expressed in number of ticks) to see how much time has passed since measurement started.

The following example shows a function that needs to do some kind of work but does not want to hold the CPU for more than one tick. When do_something says the work is completed by setting job_done to a nonzero value, the function can return:

unsigned long start_time = jiffies;
int job_done = 0;
do {
    do_something(&job_done);
    If (job_done)
        return;
while (jiffies - start_time < 1);

For a couple of examples involving real kernel code using jiffies, see the section "Backlog Processing: The process_backlog Poll Virtual Function" in Chapter 10, or the section "Asynchronous cleanup: the neigh_periodic_timer function" in Chapter 27.

User-Space Tools

Different tools can be used to configure the many networking features available on Linux. As mentioned at the beginning of the chapter, you can make thoughtful use of these tools to manipulate the kernel for learning purposes and to discover the effects of these changes.

The following tools are the ones I will refer often to in this book:

iputils: Besides the perennial command ping, iputils includes arping (used to generate ARP requests), the Network Router Discovery daemon rdisc, and others.
net-tools: This is a suite of networking tools, where you can find the well-known ifconfig, route, netstat, and arp, but also ipmaddr, iptunnel, ether-wake, netplugd, etc.
IPROUTE2: This is the new-generation networking configuration suite (although it has been around for a few years already). Through an omnibus command named ip, the suite can be used to configure IP addresses and routing along with all of its advanced features, neighboring protocols, etc.

IPROUTE2’s source code can be downloaded from http://linux-net.osdl.org/index.php/Iproute2, and the other packages can be downloaded from the download server of most Linux distributions.

These packages are included by default on most (if not all) Linux distributions. Whenever you do not understand how the kernel code processes a command from user space, I encourage you to look at the user-space tool source code and see how the command from the user is packaged and sent to the kernel.

At the following URLs, you can find good documentation on how to use the aforementioned tools, including active mailing lists:^[*]

If you want to follow the latest changes in the networking code, keep an eye on the following mailing list:

The Linux Network Development List Archives (http://oss.sgi.com/projects/netdev/archive)

Other, more specific URLs will be given in the associated chapters.

Browsing the Source Code

The Linux kernel has gotten pretty big, and browsing the code with our old friend grep is definitely not a good idea anymore. Nowadays you can count on different pieces of software to make your journey into the kernel code a better experience.

One that I would like to suggest to those that do not know it already is cscope, which you can download from http://cscope.sourceforge.net. It is a simple yet powerful tool for searching, for example, where a function or variable is defined, where it is called, etc. Installing the tool is straightforward and you can find all the necessary instructions on the web site.

Each of us has his preferred editor, and probably the majority of us are fans of some form of either Emacs or vi. Both editors can use a special file called a “tags” file, to allow the user to move through source code. (cscope also uses a similar database file.) You can easily create such files with a synonymous target in the kernel root tree’s makefile. The three databases: TAGS, tags, and cscope.out, are created, respectively, with make TAGS, make tags, and make cscope.^[*]

Be aware that those files are pretty big, especially the one used by cscope. Therefore, make sure before building the file that you have a lot of free disk space.

If you are already using other source navigation tools, fine. But if you are not using any and have been lazy so far, it is time to say goodbye to grep and invest 15 minutes in learning how to use the aforementioned tools—they are well worth it.

Dead Code

The kernel, like any other large and dynamic piece of software, includes pieces of code that are no longer invoked. Unfortunately, you rarely see comments in the code that tell you this. You may sometimes find yourself having trouble trying to understand how a given function is used or a given variable is initialized simply because you are looking at dead code. If you are lucky, that code does not compile and you can guess its out-of-date status. Other times you may not be that lucky.

Each kernel subsystem is supposed to be assigned one or more maintainers. However, some maintainers simply have too much code to look at, and insufficient free time to do it. Other times they may have lost interest in maintaining their subsystems but could not find any substitutes for their role. It is therefore good to keep this in mind when looking at code that seems to do something strange or that simply does not adhere to general, common-sense programming rules.

In this book, I tried, whenever meaningful, to alert you about functions, variables, and data structure fields that are not used, perhaps because they were left behind when removing a feature or because they were introduced for a new feature whose coding was never completed.

When a Feature Is Offered as a Patch

The kernel networking code is continuously evolving. Not only does it integrate new features, but existing components sometimes undergo design changes to achieve more modularity and higher performance. This obviously makes Linux very attractive as an embedded operating system for network appliance products (routers, switches, firewalls, load balancers, etc.).

Because anyone can develop a new feature for the Linux kernel, or extend or reimplement an existing one, the greatest thrill for any “open” developer is to see her work make it to the official kernel release. Sometimes, however, that is not possible or it may take a long time, even when a project has valuable features and is well implemented. Common reasons include:

The code may not have been written following the guidelines in Documentation/CodingStyle.
Another major project that provides the same functionality has been around for some time and has already received the green light from the Linux community and from the key kernel developers that maintain the associated kernel area.
There is too much overlap with another kernel component. In a case like this, the best approach is to remove the redundant functionality and use existing functionality where possible, or to extend the latter so that it can be used in new contexts. This situation underlines the importance of modularity.
The size of the project and the amount of work required to maintain it in a quick-changing kernel may lead the new project’s developers to keep it as a separate patch and release a new version only once in a while.
The feature would be used only in very specific scenarios, considered not necessary in a general-purpose operating system. In this case, a separate patch is often the best solution.
The overall design may not satisfy some key kernel developers. These experts usually have the big picture in mind, concerning both where the kernel is and where it is going. Often, they request design changes to make a feature fit into the kernel the right way.

Sometimes, overlap between features is hard to remove completely, perhaps, for example, because a feature is so flexible that its different uses become apparent only after some time. For example, the firewall has hooks in several places in the network stack. This makes it unnecessary for other features to implement any filtering or marking of data packets going in any direction: they can simply rely on the firewall. Of course, this creates dependencies (for example, if the routing subsystem wants to mark traffic matching specific criteria, the kernel must include support for the firewall). Also, the firewall maintainers must be ready to accept reasonable enhancement requests when they are deemed to be required by other kernel features. However, the compromise is often worth the gain: less redundant code means fewer bugs, easier code maintenance, simplified code paths, and other benefits.

An example of a recent cleanup of feature overlap is the removal of stateless Network Address Translation (NAT) support by the routing code in version 2.6 of the kernel. The developers realized that the stateful NAT support in the firewall is more flexible, and therefore that it was no longer worthwhile maintaining stateless NAT code (although it is faster and consumes less memory). Note that a new module could be written for Netfilter at any time to provide stateless NAT support if necessary.

^[*]See Chapter 7 for a description of this macro.

^[*]For more documentation, you can refer to the following URL maintained by the author: http://www.rdrop.com/users/paulmck/rclock.

^[*]I do not cover the firewall infrastructure design in this book, but I often show where the firewall hooks are located when analyzing various network protocols and layers.

^[*]The tags and TAGS files are created with the help of the ctags utility.

Get Understanding Linux Network Internals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Understanding Linux Network Internals by Christian Benvenuti