poll and select

Applications that use nonblocking I/O often use the poll and select system calls as well. poll and select have essentially the same functionality: both allow a process to determine whether it can read from or write to one or more open files without blocking. They are thus often used in applications that must use multiple input or output streams without blocking on any one of them. The same functionality is offered by two separate functions because they were implemented in Unix almost at the same time by two different groups: select was introduced in BSD Unix, whereas poll was the System V solution.

Support for either system call requires support from the device driver to function. In version 2.0 of the kernel the device method was modeled on select (and no poll was available to user programs); from version 2.1.23 onward both were offered, and the device method was based on the newly introduced poll system call because poll offered more detailed control than select.

Implementations of the poll method, implementing both the poll and select system calls, have the following prototype:

 unsigned int (*poll) (struct file *, poll_table *);

The driver’s method will be called whenever the user-space program performs a poll or select system call involving a file descriptor associated with the driver. The device method is in charge of these two steps:

  1. Call poll_wait on one or more wait queues that could indicate a change in the poll status.

  2. Return a bit mask describing operations that could be immediately performed without blocking.

Both of these operations are usually straightforward, and tend to look very similar from one driver to the next. They rely, however, on information that only the driver can provide, and thus must be implemented individually by each driver.

The poll_table structure, the second argument to the poll method, is used within the kernel to implement the poll and select calls; it is declared in <linux/poll.h>, which must be included by the driver source. Driver writers need know nothing about its internals and must use it as an opaque object; it is passed to the driver method so that every event queue that could wake up the process and change the status of the poll operation can be added to the poll_table structure by calling the function poll_wait:

 void poll_wait (struct file *, wait_queue_head_t *, poll_table *);

The second task performed by the poll method is returning the bit mask describing which operations could be completed immediately; this is also straightforward. For example, if the device has data available, a read would complete without sleeping; the poll method should indicate this state of affairs. Several flags (defined in <linux/poll.h>) are used to indicate the possible operations:

POLLIN

This bit must be set if the device can be read without blocking.

POLLRDNORM

This bit must be set if “normal” data is available for reading. A readable device returns (POLLIN | POLLRDNORM).

POLLRDBAND

This bit indicates that out-of-band data is available for reading from the device. It is currently used only in one place in the Linux kernel (the DECnet code) and is not generally applicable to device drivers.

POLLPRI

High-priority data (out-of-band) can be read without blocking. This bit causes select to report that an exception condition occurred on the file, because select reports out-of-band data as an exception condition.

POLLHUP

When a process reading this device sees end-of-file, the driver must set POLLHUP (hang-up). A process calling select will be told that the device is readable, as dictated by the select functionality.

POLLERR

An error condition has occurred on the device. When poll is invoked, the device is reported as both readable and writable, since both read and write will return an error code without blocking.

POLLOUT

This bit is set in the return value if the device can be written to without blocking.

POLLWRNORM

This bit has the same meaning as POLLOUT, and sometimes it actually is the same number. A writable device returns (POLLOUT | POLLWRNORM).

POLLWRBAND

Like POLLRDBAND, this bit means that data with nonzero priority can be written to the device. Only the datagram implementation of poll uses this bit, since a datagram can transmit out of band data.

It’s worth noting that POLLRDBAND and POLLWRBAND are meaningful only with file descriptors associated with sockets: device drivers won’t normally use these flags.

The description of poll takes up a lot of space for something that is relatively simple to use in practice. Consider the scullpipe implementation of the poll method:

unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
  Scull_Pipe *dev = filp->private_data;
  unsigned int mask = 0;

  /*
   * The buffer is circular; it is considered full
   * if "wp" is right behind "rp". "left" is 0 if the
   * buffer is empty, and it is "1" if it is completely full.
   */
  int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize;

  poll_wait(filp, &dev->inq, wait);
  poll_wait(filp, &dev->outq, wait);
  if (dev->rp != dev->wp) mask |= POLLIN | POLLRDNORM; /* readable */
  if (left != 1)     mask |= POLLOUT | POLLWRNORM; /* writable */

  return mask;
}

This code simply adds the two scullpipe wait queues to the poll_table, then sets the appropriate mask bits depending on whether data can be read or written.

The poll code as shown is missing end-of-file support. The poll method should return POLLHUP when the device is at the end of the file. If the caller used the select system call, the file will be reported as readable; in both cases the application will know that it can actually issue the read without waiting forever, and the read method will return 0 to signal end-of-file.

With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, whereas in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullpipe is a trashcan where everyone can put data as long as there’s at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel.

Implementing end-of-file in the same way as FIFOs do would mean checking dev->nwriters, both in read and in poll, and reporting end-of-file (as just described) if no process has the device opened for writing. Unfortunately, though, if a reader opened the scullpipe device before the writer, it would see end-of-file without having a chance to wait for data. The best way to fix this problem would be to implement blocking within open; this task is left as an exercise for the reader.

Interaction with read and write

The purpose of the poll and select calls is to determine in advance if an I/O operation will block. In that respect, they complement read and write. More important, poll and select are useful because they let the application wait simultaneously for several data streams, although we are not exploiting this feature in the scull examples.

A correct implementation of the three calls is essential to make applications work correctly. Though the following rules have more or less already been stated, we’ll summarize them here.

Reading data from the device

  • If there is data in the input buffer, the read call should return immediately, with no noticeable delay, even if less data is available than the application requested and the driver is sure the remaining data will arrive soon. You can always return less data than you’re asked for if this is convenient for any reason (we did it in scull), provided you return at least one byte.

  • If there is no data in the input buffer, by default read must block until at least one byte is there. If O_NONBLOCK is set, on the other hand, read returns immediately with a return value of -EAGAIN (although some old versions of System V return 0 in this case). In these cases poll must report that the device is unreadable until at least one byte arrives. As soon as there is some data in the buffer, we fall back to the previous case.

  • If we are at end-of-file, read should return immediately with a return value of 0, independent of O_NONBLOCK. poll should report POLLHUP in this case.

Writing to the device

  • If there is space in the output buffer, write should return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, poll reports that the device is writable.

  • If the output buffer is full, by default write blocks until some space is freed. If O_NONBLOCK is set, write returns immediately with a return value of -EAGAIN (older System V Unices returned 0). In these cases poll should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write returns -ENOSPC (“No space left on device”), independently of the setting of O_NONBLOCK.

  • Never make a write call wait for data transmission before returning, even if O_NONBLOCK is clear. This is because many applications use select to find out whether a write will block. If the device is reported as writable, the call must consistently not block. If the program using the device wants to ensure that the data it enqueues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point.

Although these are a good set of general rules, one should also recognize that each device is unique and that sometimes the rules must be bent slightly. For example, record-oriented devices (such as tape drives) cannot execute partial writes.

Flushing pending output

We’ve seen how the write method by itself doesn’t account for all data output needs. The fsync function, invoked by the system call of the same name, fills the gap. This method’s prototype is

 int (*fsync) (struct file *file, struct dentry *dentry, int datasync);

If some application will ever need to be assured that data has been sent to the device, the fsync method must be implemented. A call to fsync should return only when the device has been completely flushed (i.e., the output buffer is empty), even if that takes some time, regardless of whether O_NONBLOCK is set. The datasync argument, present only in the 2.4 kernel, is used to distinguish between the fsync and fdatasync system calls; as such, it is only of interest to filesystem code and can be ignored by drivers.

The fsync method has no unusual features. The call isn’t time critical, so every device driver can implement it to the author’s taste. Most of the time, char drivers just have a NULL pointer in their fops. Block devices, on the other hand, always implement the method with the general-purpose block_fsync, which in turn flushes all the blocks of the device, waiting for I/O to complete.

The Underlying Data Structure

The actual implementation of the poll and select system calls is reasonably simple, for those who are interested in how it works. Whenever a user application calls either function, the kernel invokes the poll method of all files referenced by the system call, passing the same poll_table to each of them. The structure is, for all practical purposes, an array of poll_table_entry structures allocated for a specific poll or select call. Each poll_table_entry contains the struct file pointer for the open device, a wait_queue_head_t pointer, and a wait_queue_t entry. When a driver calls poll_wait, one of these entries gets filled in with the information provided by the driver, and the wait queue entry gets put onto the driver’s queue. The pointer to wait_queue_head_t is used to track the wait queue where the current poll table entry is registered, in order for free_wait to be able to dequeue the entry before the wait queue is awakened.

If none of the drivers being polled indicates that I/O can occur without blocking, the poll call simply sleeps until one of the (perhaps many) wait queues it is on wakes it up.

What’s interesting in the implementation of poll is that the file operation may be called with a NULL pointer as poll_table argument. This situation can come about for a couple of reasons. If the application calling poll has provided a timeout value of 0 (indicating that no wait should be done), there is no reason to accumulate wait queues, and the system simply does not do it. The poll_table pointer is also set to NULL immediately after any driver being polled indicates that I/O is possible. Since the kernel knows at that point that no wait will occur, it does not build up a list of wait queues.

When the poll call completes, the poll_table structure is deallocated, and all wait queue entries previously added to the poll table (if any) are removed from the table and their wait queues.

Actually, things are somewhat more complex than depicted here, because the poll table is not a simple array but rather a set of one or more pages, each hosting an array. This complication is meant to avoid putting too low a limit (dictated by the page size) on the maximum number of file descriptors involved in a poll or select system call.

We tried to show the data structures involved in polling in Figure 5-2; the figure is a simplified representation of the real data structures because it ignores the multipage nature of a poll table and disregards the file pointer that is part of each poll_table_entry. The reader interested in the actual implementation is urged to look in <linux/poll.h> and fs/select.c.

The data structures of poll

Figure 5-2. The data structures of poll

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.