The mmap Device Operation

Memory mapping is one of the most interesting features of modern Unix systems. As far as drivers are concerned, memory mapping can be used to provide user programs with direct access to device memory.

A definitive example of mmap usage can be seen by looking at a subset of the virtual memory areas for the X Window System server:

cat /proc/731/maps
08048000-08327000 r-xp 00000000 08:01 55505    /usr/X11R6/bin/XF86_SVGA
08327000-08369000 rw-p 002de000 08:01 55505    /usr/X11R6/bin/XF86_SVGA
40015000-40019000 rw-s fe2fc000 08:01 10778    /dev/mem
40131000-40141000 rw-s 000a0000 08:01 10778    /dev/mem
40141000-40941000 rw-s f4000000 08:01 10778    /dev/mem
     ...

The full list of the X server’s VMAs is lengthy, but most of the entries are not of interest here. We do see, however, three separate mappings of /dev/mem, which give some insight into how the X server works with the video card. The first mapping shows a 16 KB region mapped at fe2fc000. This address is far above the highest RAM address on the system; it is, instead, a region of memory on a PCI peripheral (the video card). It will be a control region for that card. The middle mapping is at a0000, which is the standard location for video RAM in the 640 KB ISA hole. The last /dev/mem mapping is a rather larger one at f4000000 and is the video memory itself. These regions can also be seen in /proc/iomem:

000a0000-000bffff : Video RAM area
f4000000-f4ffffff : Matrox Graphics, Inc. MGA G200 AGP
fe2fc000-fe2fffff : Matrox Graphics, Inc. MGA G200 AGP

Mapping a device means associating a range of user-space addresses to device memory. Whenever the program reads or writes in the assigned address range, it is actually accessing the device. In the X server example, using mmap allows quick and easy access to the video card’s memory. For a performance-critical application like this, direct access makes a large difference.

As you might suspect, not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices. Another limitation of mmap is that mapping is PAGE_SIZE grained. The kernel can dispose of virtual addresses only at the level of page tables; therefore, the mapped area must be a multiple of PAGE_SIZE and must live in physical memory starting at an address that is a multiple of PAGE_SIZE. The kernel accommodates for size granularity by making a region slightly bigger if its size isn’t a multiple of the page size.

These limits are not a big constraint for drivers, because the program accessing the device is device dependent anyway. It needs to know how to make sense of the memory region being mapped, so the PAGE_SIZE alignment is not a problem. A bigger constraint exists when ISA devices are used on some non-x86 platforms, because their hardware view of ISA may not be contiguous. For example, some Alpha computers see ISA memory as a scattered set of 8-bit, 16-bit, or 32-bit items, with no direct mapping. In such cases, you can’t use mmap at all. The inability to perform direct mapping of ISA addresses to Alpha addresses is due to the incompatible data transfer specifications of the two systems. Whereas early Alpha processors could issue only 32-bit and 64-bit memory accesses, ISA can do only 8-bit and 16-bit transfers, and there’s no way to transparently map one protocol onto the other.

There are sound advantages to using mmap when it’s feasible to do so. For instance, we have already looked at the X server, which transfers a lot of data to and from video memory; mapping the graphic display to user space dramatically improves the throughput, as opposed to an lseek/write implementation. Another typical example is a program controlling a PCI device. Most PCI peripherals map their control registers to a memory address, and a demanding application might prefer to have direct access to the registers instead of repeatedly having to call ioctl to get its work done.

The mmap method is part of the file_operations structure and is invoked when the mmap system call is issued. With mmap, the kernel performs a good deal of work before the actual method is invoked, and therefore the prototype of the method is quite different from that of the system call. This is unlike calls such as ioctl and poll, where the kernel does not do much before calling the method.

The system call is declared as follows (as described in the mmap(2) manual page):

mmap (caddr_t addr, size_t len, int prot, int flags, int fd, 
off_t offset)

On the other hand, the file operation is declared as

int (*mmap) (struct file *filp, struct vm_area_struct *vma);

The filp argument in the method is the same as that introduced in Chapter 3, while vma contains the information about the virtual address range that is used to access the device. Much of the work has thus been done by the kernel; to implement mmap, the driver only has to build suitable page tables for the address range and, if necessary, replace vma->vm_ops with a new set of operations.

There are two ways of building the page tables: doing it all at once with a function called remap_page_range, or doing it a page at a time via the nopage VMA method. Both methods have their advantages. We’ll start with the “all at once” approach, which is simpler. From there we will start adding the complications needed for a real-world implementation.

Using remap_page_range

The job of building new page tables to map a range of physical addresses is handled by remap_page_range, which has the following prototype:

int remap_page_range(unsigned long virt_add, unsigned long phys_add,
                     unsigned long size, pgprot_t prot);

The value returned by the function is the usual 0 or a negative error code. Let’s look at the exact meaning of the function’s arguments:

virt_add

The user virtual address where remapping should begin. The function builds page tables for the virtual address range between virt_add and virt_add+size.

phys_add

The physical address to which the virtual address should be mapped. The function affects physical addresses from phys_add to phys_add+size.

size

The dimension, in bytes, of the area being remapped.

prot

The “protection” requested for the new VMA. The driver can (and should) use the value found in vma->vm_page_prot.

The arguments to remap_page_range are fairly straightforward, and most of them are already provided to you in the VMA when your mmap method is called. The one complication has to do with caching: usually, references to device memory should not be cached by the processor. Often the system BIOS will set things up properly, but it is also possible to disable caching of specific VMAs via the protection field. Unfortunately, disabling caching at this level is highly processor dependent. The curious reader may wish to look at the function pgprot_noncached from drivers/char/mem.c to see what’s involved. We won’t discuss the topic further here.

A Simple Implementation

If your driver needs to do a simple, linear mapping of device memory into a user address space, remap_page_range is almost all you really need to do the job. The following code comes from drivers/char/mem.c and shows how this task is performed in a typical module called simple (Simple Implementation Mapping Pages with Little Enthusiasm):

 #include <linux/mm.h>

 int simple_mmap(struct file *filp, struct vm_area_struct *vma)
 {
     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
    
     if (offset >= _&thinsp;_pa(high_memory) || (filp->f_flags & O_SYNC))
         vma->vm_flags |= VM_IO;
     vma->vm_flags |= VM_RESERVED;

     if (remap_page_range(vma->vm_start, offset, 
            vma->vm_end-vma->vm_start, vma->vm_page_prot))
         return -EAGAIN;
     return 0;
 }

The /dev/mem code checks to see if the requested offset (stored in vma->vm_pgoff) is beyond physical memory; if so, the VM_IO VMA flag is set to mark the area as being I/O memory. The VM_RESERVED flag is always set to keep the system from trying to swap this area out. Then it is just a matter of calling remap_page_range to create the necessary page tables.

Adding VMA Operations

As we have seen, the vm_area_struct structure contains a set of operations that may be applied to the VMA. Now we’ll look at providing those operations in a simple way; a more detailed example will follow later on.

Here, we will provide open and close operations for our VMA. These operations will be called anytime a process opens or closes the VMA; in particular, the open method will be invoked anytime a process forks and creates a new reference to the VMA. The open and close VMA methods are called in addition to the processing performed by the kernel, so they need not reimplement any of the work done there. They exist as a way for drivers to do any additional processing that they may require.

We’ll use these methods to increment the module usage count whenever the VMA is opened, and to decrement it when it’s closed. In modern kernels, this work is not strictly necessary; the kernel will not call the driver’s release method as long as a VMA remains open, so the usage count will not drop to zero until all references to the VMA are closed. The 2.0 kernel, however, did not perform this tracking, so portable code will still want to be able to maintain the usage count.

So, we will override the default vma->vm_ops with operations that keep track of the usage count. The code is quite simple—a complete mmap implementation for a modularized /dev/mem looks like the following:

void simple_vma_open(struct vm_area_struct *vma)
{ MOD_INC_USE_COUNT; }

void simple_vma_close(struct vm_area_struct *vma)
{ MOD_DEC_USE_COUNT; }

static struct vm_operations_struct simple_remap_vm_ops = {
    open:  simple_vma_open,
    close: simple_vma_close,
};

int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma)
{
    unsigned long offset = VMA_OFFSET(vma);

    if (offset >= __pa(high_memory) || (filp->f_flags & O_SYNC))
        vma->vm_flags |= VM_IO;
    vma->vm_flags |= VM_RESERVED;

    if (remap_page_range(vma->vm_start, offset, vma->vm_end-vma->vm_start,
                vma->vm_page_prot))
        return -EAGAIN;

    vma->vm_ops = &simple_remap_vm_ops;
    simple_vma_open(vma);
    return 0;
}

This code relies on the fact that the kernel initializes to NULL the vm_ops field in the newly created area before calling f_op->mmap. The code just shown checks the current value of the pointer as a safety measure, should something change in future kernels.

The strange VMA_OFFSET macro that appears in this code is used to hide a difference in the vma structure across kernel versions. Since the offset is a number of pages in 2.4 and a number of bytes in 2.2 and earlier kernels, <sysdep.h> declares the macro to make the difference transparent (and the result is expressed in bytes).

Mapping Memory with nopage

Although remap_page_range works well for many, if not most, driver mmap implementations, sometimes it is necessary to be a little more flexible. In such situations, an implementation using the nopage VMA method may be called for.

The nopage method, remember, has the following prototype:

 struct page (*nopage)(struct vm_area_struct *vma, 
                     unsigned long address, int write_access);

When a user process attempts to access a page in a VMA that is not present in memory, the associated nopage function is called. The address parameter will contain the virtual address that caused the fault, rounded down to the beginning of the page. The nopage function must locate and return the struct page pointer that refers to the page the user wanted. This function must also take care to increment the usage count for the page it returns by calling the get_page macro:

 get_page(struct page *pageptr);

This step is necessary to keep the reference counts correct on the mapped pages. The kernel maintains this count for every page; when the count goes to zero, the kernel knows that the page may be placed on the free list. When a VMA is unmapped, the kernel will decrement the usage count for every page in the area. If your driver does not increment the count when adding a page to the area, the usage count will become zero prematurely and the integrity of the system will be compromised.

One situation in which the nopage approach is useful can be brought about by the mremap system call, which is used by applications to change the bounding addresses of a mapped region. If the driver wants to be able to deal with mremap, the previous implementation won’t work correctly, because there’s no way for the driver to know that the mapped region has changed.

The Linux implementation of mremap doesn’t notify the driver of changes in the mapped area. Actually, it does notify the driver if the size of the area is reduced via the unmap method, but no callback is issued if the area increases in size.

The basic idea behind notifying the driver of a reduction is that the driver (or the filesystem mapping a regular file to memory) needs to know when a region is unmapped in order to take the proper action, such as flushing pages to disk. Growth of the mapped region, on the other hand, isn’t really meaningful for the driver until the program invoking mremap accesses the new virtual addresses. In real life, it’s quite common to map regions that are never used (unused sections of program code, for example). The Linux kernel, therefore, doesn’t notify the driver if the mapped region grows, because the nopage method will take care of pages one at a time as they are actually accessed.

In other words, the driver isn’t notified when a mapping grows because nopage will do it later, without having to use memory before it is actually needed. This optimization is mostly aimed at regular files, whose mapping uses real RAM.

The nopage method, therefore, must be implemented if you want to support the mremap system call. But once you have nopage, you can choose to use it extensively, with some limitations (described later). This method is shown in the next code fragment. In this implementation of mmap, the device method only replaces vma->vm_ops. The nopage method takes care of “remapping” one page at a time and returning the address of its struct page structure. Because we are just implementing a window onto physical memory here, the remapping step is simple—we need only locate and return a pointer to the struct page for the desired address.

An implementation of /dev/mem using nopage looks like the following:

struct page *simple_vma_nopage(struct vm_area_struct *vma,
                unsigned long address, int write_access)
{
    struct page *pageptr;
    unsigned long physaddr = address - vma->vm_start + VMA_OFFSET(vma);
    pageptr = virt_to_page(__va(physaddr));
    get_page(pageptr);
    return pageptr;
}

int simple_nopage_mmap(struct file *filp, struct vm_area_struct *vma)
{
    unsigned long offset = VMA_OFFSET(vma);

    if (offset >= __pa(high_memory) || (filp->f_flags & O_SYNC))
        vma->vm_flags |= VM_IO;
    vma->vm_flags |= VM_RESERVED;

    vma->vm_ops = &simple_nopage_vm_ops;
    simple_vma_open(vma);
    return 0;
}

Since, once again, we are simply mapping main memory here, the nopage function need only find the correct struct page for the faulting address and increment its reference count. The required sequence of events is thus to calculate the desired physical address, turn it into a logical address with __va, and then finally to turn it into a struct page with virt_to_page. It would be possible, in general, to go directly from the physical address to the struct page, but such code would be difficult to make portable across architectures. Such code might be necessary, however, if one were trying to map high memory, which, remember, has no logical addresses. simple, being simple, does not worry about that (rare) case.

If the nopage method is left NULL, kernel code that handles page faults maps the zero page to the faulting virtual address. The zero page is a copy-on-write page that reads as zero and that is used, for example, to map the BSS segment. Therefore, if a process extends a mapped region by calling mremap, and the driver hasn’t implemented nopage, it will end up with zero pages instead of a segmentation fault.

The nopage method normally returns a pointer to a struct page. If, for some reason, a normal page cannot be returned (e.g., the requested address is beyond the device’s memory region), NOPAGE_SIGBUS can be returned to signal the error. nopage can also return NOPAGE_OOM to indicate failures caused by resource limitations.

Note that this implementation will work for ISA memory regions but not for those on the PCI bus. PCI memory is mapped above the highest system memory, and there are no entries in the system memory map for those addresses. Because there is thus no struct page to return a pointer to, nopage cannot be used in these situations; you must, instead, use remap_page_range.

Remapping Specific I/O Regions

All the examples we’ve seen so far are reimplementations of /dev/mem; they remap physical addresses into user space. The typical driver, however, wants to map only the small address range that applies to its peripheral device, not all of memory. In order to map to user space only a subset of the whole memory range, the driver needs only to play with the offsets. The following lines will do the trick for a driver mapping a region of simple_region_size bytes, beginning at physical address simple_region_start (which should be page aligned).

unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
unsigned long physical = simple_region_start + off;
unsigned long vsize = vma->vm_end - vma->vm_start;
unsigned long psize = simple_region_size - off;

if (vsize > psize)
    return -EINVAL; /*  spans too high */
remap_page_range(vma_>vm_start, physical, vsize, vma->vm_page_prot);

In addition to calculating the offsets, this code introduces a check that reports an error when the program tries to map more memory than is available in the I/O region of the target device. In this code, psize is the physical I/O size that is left after the offset has been specified, and vsize is the requested size of virtual memory; the function refuses to map addresses that extend beyond the allowed memory range.

Note that the user process can always use mremap to extend its mapping, possibly past the end of the physical device area. If your driver has no nopage method, it will never be notified of this extension, and the additional area will map to the zero page. As a driver writer, you may well want to prevent this sort of behavior; mapping the zero page onto the end of your region is not an explicitly bad thing to do, but it is highly unlikely that the programmer wanted that to happen.

The simplest way to prevent extension of the mapping is to implement a simple nopage method that always causes a bus signal to be sent to the faulting process. Such a method would look like this:

struct page *simple_nopage(struct vm_area_struct *vma,
                           unsigned long address, int write_access);
{ return NOPAGE_SIGBUS; /* send a SIGBUS */}

Remapping RAM

Of course, a more thorough implementation could check to see if the faulting address is within the device area, and perform the remapping if that is the case. Once again, however, nopage will not work with PCI memory areas, so extension of PCI mappings is not possible. In Linux, a page of physical addresses is marked as “reserved” in the memory map to indicate that it is not available for memory management. On the PC, for example, the range between 640 KB and 1 MB is marked as reserved, as are the pages that host the kernel code itself.

An interesting limitation of remap_page_range is that it gives access only to reserved pages and physical addresses above the top of physical memory. Reserved pages are locked in memory and are the only ones that can be safely mapped to user space; this limitation is a basic requirement for system stability.

Therefore, remap_page_range won’t allow you to remap conventional addresses—which include the ones you obtain by calling get_free_page. Instead, it will map in the zero page. Nonetheless, the function does everything that most hardware drivers need it to, because it can remap high PCI buffers and ISA memory.

The limitations of remap_page_range can be seen by running mapper, one of the sample programs in misc-progs in the files provided on the O’Reilly FTP site. mapper is a simple tool that can be used to quickly test the mmap system call; it maps read-only parts of a file based on the command-line options and dumps the mapped region to standard output. The following session, for instance, shows that /dev/mem doesn’t map the physical page located at address 64 KB—instead we see a page full of zeros (the host computer in this examples is a PC, but the result would be the same on other platforms):

morgana.root# ./mapper /dev/mem 0x10000 0x1000 | od -Ax -t x1
mapped "/dev/mem" from 65536 to 69632
000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
001000

The inability of remap_page_range to deal with RAM suggests that a device like scullp can’t easily implement mmap, because its device memory is conventional RAM, not I/O memory. Fortunately, a relatively easy workaround is available to any driver that needs to map RAM into user space; it uses the nopage method that we have seen earlier.

Remapping RAM with the nopage method

The way to map real RAM to user space is to use vm_ops->nopage to deal with page faults one at a time. A sample implementation is part of the scullp module, introduced in Chapter 7.

scullp is the page oriented char device. Because it is page oriented, it can implement mmap on its memory. The code implementing memory mapping uses some of the concepts introduced earlier in Section 13.1.”

Before examining the code, let’s look at the design choices that affect the mmap implementation in scullp.

  • scullp doesn’t release device memory as long as the device is mapped. This is a matter of policy rather than a requirement, and it is different from the behavior of scull and similar devices, which are truncated to a length of zero when opened for writing. Refusing to free a mapped scullp device allows a process to overwrite regions actively mapped by another process, so you can test and see how processes and device memory interact. To avoid releasing a mapped device, the driver must keep a count of active mappings; the vmas field in the device structure is used for this purpose.

  • Memory mapping is performed only when the scullp order parameter is 0. The parameter controls how get_free_pages is invoked (see Chapter 7, Section 7.3). This choice is dictated by the internals of get_free_pages, the allocation engine exploited by scullp. To maximize allocation performance, the Linux kernel maintains a list of free pages for each allocation order, and only the page count of the first page in a cluster is incremented by get_free_pages and decremented by free_pages. The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. (Return to Section 7.3.1 in Chapter 7 if you need a refresher on scullp and the memory allocation order value.)

The last choice is mostly intended to keep the code simple. It is possible to correctly implement mmap for multipage allocations by playing with the usage count of the pages, but it would only add to the complexity of the example without introducing any interesting information.

Code that is intended to map RAM according to the rules just outlined needs to implement open, close, and nopage; it also needs to access the memory map to adjust the page usage counts.

This implementation of scullp_mmap is very short, because it relies on the nopage function to do all the interesting work:

int scullp_mmap(struct file *filp, struct vm_area_struct *vma)
{
    struct inode *inode = INODE_FROM_F(filp);

    /* refuse to map if order is not 0 */
    if (scullp_devices[MINOR(inode->i_rdev)].order)
        return -ENODEV;

    /* don't do anything here: "nopage" will fill the holes */
    vma->vm_ops = &scullp_vm_ops;
    vma->vm_flags |= VM_RESERVED;
    vma->vm_private_data = scullp_devices + MINOR(inode->i_rdev);
    scullp_vma_open(vma);
    return 0;
}

The purpose of the leading conditional is to avoid mapping devices whose allocation order is not 0. scullp’s operations are stored in the vm_ops field, and a pointer to the device structure is stashed in the vm_private_data field. At the end, vm_ops->open is called to update the usage count for the module and the count of active mappings for the device.

open and close simply keep track of these counts and are defined as follows:

void scullp_vma_open(struct vm_area_struct *vma)
{
    ScullP_Dev *dev = scullp_vma_to_dev(vma);

    dev->vmas++;
    MOD_INC_USE_COUNT;
}

void scullp_vma_close(struct vm_area_struct *vma)
{
    ScullP_Dev *dev = scullp_vma_to_dev(vma);

    dev->vmas--;
    MOD_DEC_USE_COUNT;
}

The function sculls_vma_to_dev simply returns the contents of the vm_private_data field. It exists as a separate function because kernel versions prior to 2.4 lacked that field, requiring that other means be used to get that pointer. See Section 13.5 at the end of this chapter for details.

Most of the work is then performed by nopage. In the scullp implementation, the address parameter to nopage is used to calculate an offset into the device; the offset is then used to look up the correct page in the scullp memory tree.

struct page *scullp_vma_nopage(struct vm_area_struct *vma,
                                unsigned long address, int write)
{
    unsigned long offset;
    ScullP_Dev *ptr, *dev = scullp_vma_to_dev(vma);
    struct page *page = NOPAGE_SIGBUS;
    void *pageptr = NULL; /* default to "missing" */

    down(&dev->sem);
    offset = (address - vma->vm_start) + VMA_OFFSET(vma);
    if (offset >= dev->size) goto out; /* out of range */

    /*
     * Now retrieve the scullp device from the list, then the page.
     * If the device has holes, the process receives a SIGBUS when
     * accessing the hole.
     */
    offset >>= PAGE_SHIFT; /* offset is a number of pages */
    for (ptr = dev; ptr && offset >= dev->qset;) {
        ptr = ptr->next;
        offset -= dev->qset;
    }
    if (ptr && ptr->data) pageptr = ptr->data[offset];
    if (!pageptr) goto out; /* hole or end-of-file */
    page = virt_to_page(pageptr);
    
    /* got it, now increment the count */
    get_page(page);
out:
    up(&dev->sem);
    return page;
}

scullp uses memory obtained with get_free_pages. That memory is addressed using logical addresses, so all scullp_nopage has to do to get a struct page pointer is to call virt_to_page.

The scullp device now works as expected, as you can see in this sample output from the mapper utility. Here we send a directory listing of /dev (which is long) to the scullp device, and then use the mapper utility to look at pieces of that listing with mmap.

morgana% ls -l /dev > /dev/scullp
morgana% ./mapper /dev/scullp 0 140
mapped "/dev/scullp" from 0 to 140
total 77
-rwxr-xr-x    1 root     root        26689 Mar  2  2000 MAKEDEV
crw-rw-rw-    1 root     root      14,  14 Aug 10 20:55 admmidi0
morgana% ./mapper /dev/scullp 8192 200
mapped "/dev/scullp" from 8192 to 8392
0
crw --  --  -- -    1 root     root     113,   1 Mar 26  1999 cum1
crw --  --  -- -    1 root     root     113,   2 Mar 26  1999 cum2
crw --  --  -- -    1 root     root     113,   3 Mar 26  1999 cum3

Remapping Virtual Addresses

Although it’s rarely necessary, it’s interesting to see how a driver can map a virtual address to user space using mmap. A true virtual address, remember, is an address returned by a function like vmalloc or kmap—that is, a virtual address mapped in the kernel page tables. The code in this section is taken from scullv, which is the module that works like scullp but allocates its storage through vmalloc.

Most of the scullv implementation is like the one we’ve just seen for scullp, except that there is no need to check the order parameter that controls memory allocation. The reason for this is that vmalloc allocates its pages one at a time, because single-page allocations are far more likely to succeed than multipage allocations. Therefore, the allocation order problem doesn’t apply to vmalloced space.

Most of the work of vmalloc is building page tables to access allocated pages as a continuous address range. The nopage method, instead, must pull the page tables back apart in order to return a struct page pointer to the caller. Therefore, the nopage implementation for scullv must scan the page tables to retrieve the page map entry associated with the page.

The function is similar to the one we saw for scullp, except at the end. This code excerpt only includes the part of nopage that differs from scullp:

pgd_t *pgd; pmd_t *pmd; pte_t *pte;
unsigned long lpage;

/*
 * After scullv lookup, "page" is now the address of the page
 * needed by the current process. Since it's a vmalloc address,
 * first retrieve the unsigned long value to be looked up
 * in page tables.
 */
lpage = VMALLOC_VMADDR(pageptr);
spin_lock(&init_mm.page_table_lock);
pgd = pgd_offset(&init_mm, lpage);
pmd = pmd_offset(pgd, lpage);
pte = pte_offset(pmd, lpage);
page = pte_page(*pte);
spin_unlock(&init_mm.page_table_lock);
    
/* got it, now increment the count */
get_page(page);
out:
up(&dev->sem);
return page;

The page tables are looked up using the functions introduced at the beginning of this chapter. The page directory used for this purpose is stored in the memory structure for kernel space, init_mm. Note that scullv obtains the page_table_lock prior to traversing the page tables. If that lock were not held, another processor could make a change to the page table while scullv was halfway through the lookup process, leading to erroneous results.

The macro VMALLOC_VMADDR(pageptr) returns the correct unsigned long value to be used in a page-table lookup from a vmalloc address. A simple cast of the value wouldn’t work on the x86 with kernels older than 2.1, because of a glitch in memory management. Memory management for the x86 changed in version 2.1.1, and VMALLOC_VMADDR is now defined as the identity function, as it has always been for the other platforms. Its use is still suggested, however, as a way of writing portable code.

Based on this discussion, you might also want to map addresses returned by ioremap to user space. This mapping is easily accomplished because you can use remap_page_range directly, without implementing methods for virtual memory areas. In other words, remap_page_range is already usable for building new page tables that map I/O memory to user space; there’s no need to look in the kernel page tables built by vremap as we did in scullv.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.