-
One of the major advantages of virtual memory is that each process has its own virtual address space, mapped to physical memory by the operating system.
-
This chapter explores the address space as seen by a process, and how it is managed by linux.
-
The kernel treats the userspace portion of the address space very differently from the kernel portion. For example, allocations for the kernel are satisfied immediately and are visible globally, no matter which process is current.
-
An exception to this however is vmalloc() (and consequently __vmalloc()), as it causes a minor page fault to occur to synchronise the process page tables with the reference page tables, however the page will still be allocated immediately upon request.
-
For a process, space is simply reserved in the linear address space by pointing a page table entry to a read-only globally visible page filled with zeros.
-
When the process tries to write to this table a page fault is triggered causing the kernel to allocate a new zeroed page and assign it to the PTE and mark it writeable. It's zeroed so it appears precisely the same as the global zero-filled page.
-
The userspace portion of virtual memory is not trusted nor presumed constant. After each context switch, the userspace portion of the linear address space can change except when a 'Lazy TLB' switch is used (discussed in 4.3.)
-
As a result, the kernel has to be configured to catch all exceptions and address errors raised from userspace (discussed in 4.5.)
-
From a user perspective, the address space is a flat, linear address space. The kernel's view is rather different - the address space is split between userspace which potentially changes on context switch and the kernel address space which remains constant.
-
The split is determined by the value of PAGE_OFFSET (== __PAGE_OFFSET) - 0xc0000000 on i386 - meaning that 3GiB is available for the process to use while the remaining 1GiB is always mapped by the kernel.
-
Diagramatically the kernel address space looks as follows:
0 -> |-----------------| ^ ^
| Process | | |
| Address | | |
| Space | | |
/ . / | TASK_SIZE |
\ . \ | |
/ . / | |
| | | |
PAGE_OFFSET -> |-----------------| X |
| Kernel | | Physical |
| Image | | Memory Map |
|-----------------| | |
| struct page | | (Depends on # | Linear Address
| Map (mem_map) | | physical RAM) | Space
|-----------------| ^ v | (2^BITS_PER_LONG bytes)
| Pages Gap | | VMALLOC_OFFSET |
VMALLOC_START -> |-----------------| v ^ |
| vmalloc | | |
| Address Space | | |
VMALLOC_END -> |-----------------| ^ | |
| Pages Gap | | 2 * PAGE_SIZE | |
PKMAP_BASE -> |-----------------| X | |
| kmap | | LAST_PKMAP * | VMALLOC_RESERVE |
| Address Space | | PAGE_SIZE | at minimum |
FIXADDR_START -> |-----------------| X | |
| Fixed Virtual | | __FIXADDR_SIZE | |
| Address Mapping | | | |
FIXADDR_TOP -> |-----------------| v | |
| Page Gap | | |
|-----------------| v v
-
To load the kernel image, 8MiB (the amount of memory addressed by two PGDs) is reserved at
PAGE_OFFSET
. The kernel image is placed in this reserved space during kernel page table initialisation as discussed in 3.6.1. -
Somewhere shortly after the image, the mem_map for UMA architectures (as discussed in chapter 2) is stored. The location is usually at the 16MiB mark to avoid using
ZONE_DMA
, but not always. -
For NUMA architectures, portions of the virtual
mem_map
will be scattered throughout this region and where they are actually located is architecture dependent. -
The region between
PAGE_OFFSET
andVMALLOC_START - VMALLOC_OFFSET
, is the 'physical memory map' and the size of the region depends on the amount of available physical RAM. -
Between the physical memory map and the vmalloc address space there is a gap
VMALLOC_OFFSET
in size (8MiB on i386) used to guard against out-of-bound errors. -
As an example, an i386 system with 32MiB of RAM with have
VMALLOC_START
located atPAGE_OFFSET + 0x02000000 + 0x00800000
(i.e.PAGE_OFFSET
+ 32MiB + 8MiB.) -
In low-memory systems, the remaining amount of the virtual address space, minus a 2 page gap, is used by vmalloc() for representing non-contiguous memory allocations in a contiguous virtual address space.
-
In high-memory systems, the
vmalloc
area extends as far asPKMAP_BASE
minus the two-page gap, and two extra regions are introduced - kmap and fixed virtual address mappings. -
The
kmap
region, which begins at PKMAP_BASE, is reserved for the mapping of high memory pages into low memory via kmap() (and subsequently __kmap().) We'll go into this in more detail in chapter 9. -
The fixed virtual address mapping region, which begins at FIXADDR_START and ends at FIXADDR_TOP, is used by subsystems that need to know the virtual address of a mapping at compile time, e.g. APIC mappings.
-
On i386,
FIXADDR_TOP
is statically defined to be0xffffe000
, which is one page prior to the end of the virtual address space. The size of this region is calculated at compile time via __FIXADDR_SIZE and used to index back fromFIXADDR_TOP
to give the start of the region,FIXADDR_START
. -
The region required for vmalloc(), kmap() and the fixed virtual address mappings is what limits the size of
ZONE_NORMAL
. -
As the running kernel requires these functions, a region of at least VMALLOC_RESERVE (which is aliased to __VMALLOC_RESERVE) is reserved at the top of the address space.
-
VMALLOC_RESERVE
is architecture-dependent, but on i386 it's defined as 128MiB. This explains whyZONE_NORMAL
is generally considered to be only 896MiB in size - it's the 1GiB of the upper portion of the linear address space minus the minimum 128MiB that is reserved for the vmalloc region.
-
The address space that is usable by a process is managed by a high level mm_struct.
-
Each address space consists of a a number of page-aligned regions of memory that are in use.
-
They never overlap, and represent a set of addresses which contain pages that are related to each other in protection and purpose.
-
The regions are represented by a struct vm_area_struct:
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
struct mm_struct * vm_mm; /* The address space we belong to. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next;
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, listed below. */
rb_node_t vm_rb;
/*
* For areas with an address space and backing store,
* one of the address_space->i_mmap{,shared} lists,
* for shm areas, the list of attaches, otherwise unused.
*/
struct vm_area_struct *vm_next_share;
struct vm_area_struct **vm_pprev_share;
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
unsigned long vm_raend; /* XXX: put full readahead info here. */
void * vm_private_data; /* was vm_pte (shared mem) */
};
-
A region might represent the process heap for use with
malloc()
, a memory mapped file such as a shared library or somemmap()
-ed memory. -
The pages for the region might be active and resident, paged out or even yet to be allocated.
-
If a region is backed by a file its
vm_file
field will be set. By traversingvm_file->f_dentry->d_inode->i_mapping
, the associatedaddress_space
for the region may be obtained. Theaddress_space
has all the filesystem-specific information needed to perform page-based operations on disk. -
The relationship between different address space-related structures represented diagrammatically:
|------------------| |------------------| |------------------|
- -> | struct mm_struct | ---> | struct mm_struct | ---> | struct mm_struct | - ->
|------------------| |------------------| |------------------|
mmap | ^
| |
/---------------------/ \---------------------\
v | vm_mm
|-----------------------| vm_next |-----------------------| vm_next
| struct vm_area_struct | -------------------> | struct vm_area_struct | - ->
|-----------------------| |-----------------------|
^ | vm_file | vm_ops
| | |
| \-----------------\ |
| | |
| v |
| |-------------| |
| | struct file | |
| |-------------| |
| f_dentry | v
| | |-----------------------------|
| | | struct vm_operations_struct |
| | |-----------------------------|
| v
| |---------------|
| | struct dentry |
| |---------------|
| ^ | d_inode
| | |
| i_dentry | v
| |--------------|
| | struct inode |
| |--------------|
| | i_mapping
| |
| v
| |----------------------|
| | struct address_space |
| |----------------------|
| i_mmap | ^ | a_ops
\------------/ | \-----------\
| v
| |----------------------------------|
| | struct address_space_operations |
| |----------------------------------|
|
/-------X--------\- - -
| |
mapping | | mapping
|-------------| |-------------|
| struct page | | struct page |
|-------------| |-------------|
- There are a number of system calls that affect address space and regions:
-
fork() - Creates a new process with a new address space. All the pages are marked Copy-On-Write (COW) and are shared between the two processes until a page fault occurs. Once a write-fault occurs a copy is made of the COW page for the faulting process. This is sometimes referred to as 'breaking a COW page' or a 'COW break'.
-
clone() - Similar to
fork()
, however allows context to be shared with its parent if theCLONE_VM
flag is set - this is how linux implements threading. -
mmap() - Creates a new region within the process linear address space.
-
mremap() - Remaps or resizes a region of memory. If the virtual address space is not available for mapping, the region may be moved, unless forbidden by the caller.
-
munmap() - Destroys part or all of a region. If the region being unmapped is in the middle of an existing region, the existing region is split into two separate regions.
-
shmat() - Attaches a shared memory segment to a process address space.
-
shmdt() - Removes a shared memory segment from a process address space.
-
execve() - Loads a new executable file and replaces the existing address space.
-
exit() - Destroys an address space and all its regions.
-
The process address space is described by struct mm_struct, meaning that there is only one for each process and it is shared between userspace threads.
-
Threads are in fact identified in the task list by finding all struct task_structs that have pointers to the same
mm_struct
. -
Each
task_struct
contains all the information the kernel needs about a process. -
A unique
mm_struct
is not needed for kernel threads because they will never page fault or access the userspace portion (LS - really? Surelycopy_to/from_user()
?) -
An exception to this is page faulting within the
vmalloc
space - the page fault handling code treats this as a special case and updates the current page table with information in the master page table from init_mm. -
Since an
mm_struct
is therefore not needed for kernel threads, thetask_struct->mm
field for kernel threads is alwaysNULL
. -
For some tasks, such as the boot idle task, the
mm_struct
is never set up, but, for kernel threads, a call to daemonize() (LS - no longer a function in recent kernels) will call exit_mm() (an alias to __exit_mm()) to decrement the usage counter. -
Because TLB flushes are extremely expensive (though read a message from Linus on TLB fill performance for some interesting information on Intel's optimisations in this area), esp. for architectures such as PPC, a technique called 'lazy TLB' is employed.
-
Lazy TLB avoids unnecessary TLB flushes by processes that do not access the userspace page tables because the kernel portion of the address space is always visible. The call to switch_mm() which ordinarily results in a TLB flush is avoided by borrowing the
mm_struct
used by the previous task and placing it intask_struct->active_mm
. This technique has made large improvements to context switch times. -
When entering a lazy TLB, the function enter_lazy_tlb() is called to ensure that a
mm_struct
is not shared between processors in SMP machines - it's a null operation on UP machines. -
During process exit start_lazy_tlb() is used briefly while the process is waiting to be reaped by the parent.
-
Let's take a look at struct mm_struct again:
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
rb_root_t mm_rb;
struct vm_area_struct * mmap_cache; /* last find_vma result */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects task page tables and mm->rss */
struct list_head mmlist; /* List of all active mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm;
unsigned long def_flags;
unsigned long cpu_vm_mask;
unsigned long swap_address;
unsigned dumpable:1;
/* Architecture-specific MM context */
mm_context_t context;
};
- Looking at each field:
-
mmap
- The head of a linked list of all VMA regions in the address space. -
mm_rb
- The VMAs are arranged in a linked list and in a red-black tree for fast lookups - this is the head of the tree. -
mmap_cache
- The VMA found during the last call to find_vma() is stored in this field, on the (usually fairly safe) assumption that the area will be used again soon. -
pgd
- The PGD for this process. -
mm_users
- Reference count of processes accessing the userspace portion of thismm_struct
such as page tables and file mappings. Threads and the swap_out() code, for example, will increment this count and make sure anmm_struct
is not destroyed early. When it drops to 0, exit_mmap() will delete all (userspace) mappings and tear down the page tables before decrementingmm_count
. -
mm_count
- Reference count of the 'anonymous users' for themm_struct
, initialised at 1 for the real user. An anonymous user is one that does not necessarily care about the userspace portion and is just 'borrowing' themm_struct
, for example kernel threads that use lazy TLB switching. When this count drops to 0, themm_struct
can be safely destroyed. We need this reference count as well asmm_users
because anonymous users needmm_struct
to exist even if the userspace mappings get destroyed, and there is no point in delaying the teardown of the page tables. -
map_count
- Number of VMAs in use. -
mmap_sem
- A long-lived lock that protects the VMA list for readers and writers. Since users of this lock need it for a long time and may sleep, a spinlock is inappropriate. A reader of the list takes this semaphore with down_read(), and a writer with down_write() before taking thepage_table_lock
spinlock when the VMA linked lists are being updated. -
page_table_lock
- Protects most fields on themm_struct
. As well as the page tables, it protects the Resident Set Size (RSS) count,rss
, and the VMA from modification. -
mmlist
- struct list_head entry formm_struct
s init_mm.mmlist
which is protected by mmlist_lock. -
start_code
,end_code
- [Start, end) address for code section. -
start_data
,end_data
- [Start, end) address for data section. -
start_brk
,brk
- [Start, end) address of the heap. -
start_stack
- Surprisingly the address at the start of the stack region. -
arg_start
,arg_end
- [Start, end) address for command-line arguments. -
env_start
,env_end
- [Start, end] address for environment variables. -
rss
- Resident Set Size (RSS) is the number of resident pages for this process. The global zero page is not account for by RSS. -
total_vm
- The total memory space occupied by all VMA regions in this process. -
locked_vm
- The number of resident pages locked in memory. -
def_flags
- This has only one possible value if set -VM_LOCKED
- and is used to determine all future mappings are locked by default. -
cpu_vm_mask
- A bitmask represents all possible CPUs in an SMP system. It's used by an Inter-Processor Interrupt (IPI) to determine if a processor should execute a particular function or not. This matters during TLB flush for each CPU. -
swap_address
- Used by the pageout daemon to record the last address that was swapped from when swapping out entire processes. -
dumpable
- Set via the prctl() userland function/system call - it's only meaningful when a process is being traced. -
context
- Architecture-specific MMU context.
- There are a number of functions which interact with
mm_struct
s:
-
mm_init() - Initialises an
mm_struct
by setting starting values for each field, allocating a PGD, initialising spinlocks, etc. -
allocate_mm() - Allocates an
mm_struct
from the slab allocator (see chapter 8 for more on this.) -
mm_alloc() - Allocates an
mm_struct
usingallocate_mm()
and callsmm_init()
to initialise it. -
exit_mmap() - Walks through an
mm_struct
and unmaps all VMAs associated with it. -
copy_mm() - Makes an exact copy of the current tasks
mm_struct
needs for a new task. This is only used duringfork
. -
free_mm() - Returns the
mm_struct
to the slab allocator.
- Two functions are available to allocate an
mm_struct
- allocate_mm() which simply a pre-processor macro which allocates anmm_struct
from the slab allocator, whereas mm_alloc() callsallocate_mm()
and initialises it via mm_init().
-
The first
mm_struct
in the system that is initialised is called init_mm. All subsequentmm_struct
s are copies of a parentmm_struct
, meaning thatinit_mm
has to be statically initialised at compile time. -
The static initialisation is performed by the macro INIT_MM():
#define INIT_MM(name) \
{ \
mm_rb: RB_ROOT, \
pgd: swapper_pg_dir, \
mm_users: ATOMIC_INIT(2), \
mm_count: ATOMIC_INIT(1), \
mmap_sem: __RWSEM_INITIALIZER(name.mmap_sem), \
page_table_lock: SPIN_LOCK_UNLOCKED, \
mmlist: LIST_HEAD_INIT(name.mmlist), \
}
- After it is established, new
mm_struct
s are created using their parentmm_struct
as a template via copy_mm() which uses init_mm() to initialise process-specific fields.
-
When a new user increments the
mm_struct
's usage count (i.e.mm_users
), this is done with a simple atomic increment e.g.atomic_inc(&mm->mm_users)
. -
However, when this field is decremented, this is done via mmput() so that if this field reaches zero, the appropriate data can be properly teared down.
-
When
mm_users
reaches 0, the mapped regions are destroyed via exit_mmap(), then the page tables are destroyed because there are no longer any users of the userspace portions. Themm_count
is decremented via mmdrop(), because all users of the page tables and VMAs are counted as a singlemm_struct
user. -
As discussed above, when
mm_count
reaches 0, themm_struct
will be destroyed.
-
The full address space of a process is rarely used, only sparse regions are.
-
As discussed previously, each region is represented by a struct vm_area_struct which never overlaps and represents a set of addresses with the same protection and purpose, for example a read-only shared library loaded into the address space, or the process heap.
-
A full list of mapped regions that a process has may be viewed via
/proc/<PID>/maps
. -
A region might have a number of different structures associated with at as shown in the diagram from earlier in the chapter, with the
vm_area_struct
containing all the information associated with a region. -
If a region is backed by a file, the struct file is available via
vm_file
which subsequently has a pointer to the file's struct inode which can be used to access struct address_space:
struct address_space {
struct list_head clean_pages; /* list of clean pages */
struct list_head dirty_pages; /* list of dirty pages */
struct list_head locked_pages; /* list of locked pages */
unsigned long nrpages; /* number of total pages */
struct address_space_operations *a_ops; /* methods */
struct inode *host; /* owner: inode, block_device */
struct vm_area_struct *i_mmap; /* list of private mappings */
struct vm_area_struct *i_mmap_shared; /* list of shared mappings */
spinlock_t i_shared_lock; /* and spinlock protecting it */
int gfp_mask; /* how to allocate the pages */
};
-
This structure contains all the private information about the file, including filesystem functions that perform filesystem-specific operations such as reading and writing pages to disk (i.e.
a_ops
.) -
Taking another look at struct vm_area_struct:
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
struct mm_struct * vm_mm; /* The address space we belong to. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next;
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, listed below. */
rb_node_t vm_rb;
/*
* For areas with an address space and backing store,
* one of the address_space->i_mmap{,shared} lists,
* for shm areas, the list of attaches, otherwise unused.
*/
struct vm_area_struct *vm_next_share;
struct vm_area_struct **vm_pprev_share;
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
unsigned long vm_raend; /* XXX: put full readahead info here. */
void * vm_private_data; /* was vm_pte (shared mem) */
};
- Looking at each field:
-
vm_mm
- The struct mm_struct this VMA belongs to. -
vm_start
,vm_end
- The [start, end) addresses for this region :]. -
vm_next
- The nextvm_area_struct
in the region - all VMAs in an address space are linked together in an address-ordered singly linked list. Interestingly, this is one of a very few cases where a singly-linked list is used in the kernel. -
vm_page_prot
- Protection flags that are set for each PTE in this VMA, described below. -
vm_flags
- Protections/properties of the VMA itself, described below. -
vm_rb
- As well as being stored in a linked list, all VMAs are stored on a red-black tree for fast lookups. This is important for page fault handling where finding the correct region quickly is important, especially for a large number of mapped regions. -
vm_next_share
,vm_pprev_share
- Links together shared VMA regions based on file mappings (e.g. shared libraries.) -
vm_ops
- Of type struct vm_operations_struct containing function pointers foropen()
,close()
andnopage()
- used for syncing information with the disk. -
vm_pgoff
- The page-aligned offset within the file that is memory-mapped. -
vm_file
- The struct file pointer to the file being mapped. -
vm_raend
- The end address of a 'read-ahead window'. When a fault occurs, a number of additional pages after the desired page will be paged in. This field determines how many additional pages are faulted in. -
vm_private_data
- Used by some drivers to store private information - unused by the memory manager.
- The available PTE protection flags (as used in
vm_page_prot
) are as follows (as discussed in 3.2):
- _PAGE_PRESENT - The page is resident in memory and not swapped out.
- _PAGE_RW - Page can be written to.
- _PAGE_USER - Page is accessible from user space.
- _PAGE_ACCESSED - Page has been accessed.
- _PAGE_DIRTY - Page has been written to.
- _PAGE_PROTNONE - The page is resident, but not accessible.
-
The available VMA protection/property flags as set in
vm_flags
are as follows: -
Protection flags:
VM_READ
- Pages may be read.VM_WRITE
- Pages may be written.VM_EXEC
- Pages may be executed.VM_SHARED
- Pages may be shared.VM_DONTCOPY
- VMA will not be copied on fork.VM_DONTEXPAND
- Prevents a region from being resized. Unused.
mmap
-related flags:
VM_MAYREAD
- Allows theVM_READ
flag to be set.VM_MAYWRITE
- Allows theVM_WRITE
flag to be set.VM_MAYEXEC
- Allows theVM_EXEC
flag to be set.VM_MAYSHARE
- Allows theVM_SHARE
flag to be set.VM_GROWSDOWN
- Shared segment (probably stack) may grow down.VM_GROWSUP
- Shared segment (probably heap) may grow up.VM_SHM
- Pages are used by shared SHM memory segment.VM_DENYWRITE
- Unused. WhatMAP_DENYWRITE
for mmap() translates to (linux ignores it.)VM_EXECUTABLE
- Unused. WhatMAP_EXECUTABLE
for mmap() translate to (linux ignores it.)VM_STACK_FLAGS
- Flags used by setup_arg_pages() to set up the stack.
- Locking flags:
VM_LOCKED
- If set, the pages will not be swapped out. Set by the userland function (hence system call) mlock().VM_IO
- Signals that the area is anmmap
-ed region for I/O to a device - also prevents the region from being core dumped.VM_RESERVED
- Do not swap out this region, it is being used by device drivers.
madvise()
flags, set via the userland function (hence system call) madvise():
VM_SEQ_READ
- A hint that pages will be accessed sequentially.VM_RAND_READ
- A hint that read-ahead in the region is not useful.
-
All the VMA regions are linked together in a linked list via
vm_next
. When searching for a free area, it's simple to traverse the list. -
However, when searching for a page in particular (for example during page faulting), using the red-black tree is more efficient is it provides average O(lg N) search time vs. the O(N) of a linear search. The tree is ordered such that lower address than the current node are on the left leaf, and higher addresses are on the right.
- There are three operations which a VMA may support -
open()
,close()
andnopage()
, provided via a struct vm_operations_struct atvma->vm_ops
:
/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
* to the functions called when a no-page or a wp-page exception occurs.
*/
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
};
-
The
open()
andclose()
operations are called every time a region is created or deleted. Only a small number of devices care about these, in 2.4.22 only one filesystem and the System V shared region functionality use them. -
The main callback of interest is the
nopage()
function. It is used during a page-fault by do_no_page(). It is responsible for locating the page in the page cache, or allocating a page and populating it with the required data before returning it. -
Most files that are mapped will use a generic
vm_operations_struct
called generic_file_vm_ops:
static struct vm_operations_struct generic_file_vm_ops = {
nopage: filemap_nopage,
};
- This provides a
nopage
entry of filemap_nopage() which either locates the page in the page cache, or reads the information from disk.
-
In the event that a region is backed by a file, the
vm_file
field contains a struct address_space containing information of relevance to the filesystem such as the number of dirty pages that need to be flushed to disk. -
Let's review the
address_space
struct once again:
struct address_space {
struct list_head clean_pages; /* list of clean pages */
struct list_head dirty_pages; /* list of dirty pages */
struct list_head locked_pages; /* list of locked pages */
unsigned long nrpages; /* number of total pages */
struct address_space_operations *a_ops; /* methods */
struct inode *host; /* owner: inode, block_device */
struct vm_area_struct *i_mmap; /* list of private mappings */
struct vm_area_struct *i_mmap_shared; /* list of shared mappings */
spinlock_t i_shared_lock; /* and spinlock protecting it */
int gfp_mask; /* how to allocate the pages */
};
- Looking at each field:
clean_pages
- A list of clean pages that require no synchronisation with backing storage.dirty_pages
- A list of dirty pages that do require synchronisation with backing storage.locked_pages
- A list of pages that are locked in memory.nrpages
- Number of resident pages in use by the address space.a_ops
- A struct address_space_operations for manipulating the filesystem. Each filesystem uses its own, though some may use generic functions.host
- The host inode the file belongs to.i_mmap
- A list of private mappings using thisaddress_space
.i_mmap_shared
- A list of VMAs that share mappings in thisaddress_space
.i_shared_lock
- A spinlock used to protect this structure.gfp_mask
- The mask to use when calling __alloc_pages() for new pages.
- Periodically, the memory manager will need to flush information to disk which is performed via struct address_space_operations. Let's look at this data structure:
struct address_space_operations {
int (*writepage)(struct page *);
int (*readpage)(struct file *, struct page *);
int (*sync_page)(struct page *);
/*
* ext3 requires that a successful prepare_write() call be followed
* by a commit_write() call - they must be balanced
*/
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
int (*bmap)(struct address_space *, long);
int (*flushpage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
#define KERNEL_HAS_O_DIRECT /* this is for modules out of the kernel */
int (*direct_IO)(int, struct inode *, struct kiobuf *, unsigned long, int);
#define KERNEL_HAS_DIRECT_FILEIO /* Unfortunate kludge due to lack of foresight */
int (*direct_fileIO)(int, struct file *, struct kiobuf *, unsigned long, int);
void (*removepage)(struct page *); /* called when page gets removed from the inode */
};
- Looking at each field:
writepage
- Writes a page to disk. The offset within the file to write is stored within struct page. It is up to the filesystem-specific code to find the block. See block_write_full_page() for more details.readpage
- Reads a page from disk. See block_read_full_page() for more details.sync_page
- Synchronises a dirty page with disk. See block_sync_page() for more details.prepare_write
- This is called before data is copied from userspace into a page that will be written to disk. With a journaled filesystem, this ensures the filesystem log is up to date. With an unjournaled filesystem, it makes sure the pages are allocated. See block_prepare_write() (and subsequently __block_prepare_write()) for more details.commit_write
- After the data has been copied from userspace, this function is called to commit the information to disk. See block_commit_write() (and subsequently __block_commit_write()) for more details.bmap
- Maps a block so that raw I/O can be performed. It's mostly useful to filesystem-specific code, although it is also used when swapping out pages that are backed by a swap file instead of a swap partition.flushpage
- Makes sure there is no I/O pending on a page before releasing it. See discard_bh_page() for more details.releasepage
- Tries to flush all the buffers associated with a page before freeing the page itself. See try_to_free_buffers() for more details.direct_IO
- This function is used when performing direct I/O to an inode. The#define
here is used so external modules can determine whether the function is available at compile time as it was only introduced in 2.4.21.direct_fileIO
- Used to perform direct I/O with a struct file. Again the#define#
is present for the same reason as above.removepage
- An optional callback that is used when a page is removed from the page cache in remove_page_from_inode_queue().
-
The system call mmap() is provided for creating new memory regions within a process. For i386, the function calls sys_mmap2() which calls do_mmap2() directly, with the same parameters.
-
do_mmap2()
is responsible for determining the parameters needed by do_mmap_pgoff() which is the principal function for creating new areas for all architectures. -
do_mmap2()
first clears theMAP_DENYWRITE
andMAP_EXECUTABLE
bits from theflags
parameter because they are ignored by linux. -
If a file is being mapped,
do_mmap2()
will look up its struct file based on the file descriptor passed a parameter, then acquire themm_struct->mmap_sem
semaphore before calling do_mmap_pgoff(). -
do_mmap_pgoff()
starts by doing some sanity checks:
-
Ensuring that the appropriate filesystem or device functions are available if a file or device is being mapped.
-
Ensuring that the size of the mapping is page-aligned and doesn't attempt to create a mapping in the kernel portion of the address space.
-
Ensuring that the size of mapping does not overflow the range of
pgoff
. -
Ensuring that the process does not have too many mapped regions already.
- The rest of the function is rather large and involved, but can broadly-speaking be described by the following steps:
-
Sanity check the parameters.
-
Find a free linear address space large enough for the memory mapping. If a filesystem or device-specific
get_unmapped_area()
function is provided in the file'sf_op
field (of type struct file_operations) it is used, otherwise arch_get_unmapped_area() is called. -
Calculate the VM flags and check them against the file access permissions.
-
If an old area exists where the mapping is to take place, fix it up so it is suitable for for the new mapping (i.e.
-
Allocate a struct vm_area_struct from the slab allocator and fill in its entries.
-
Link in the new VMA.
-
Call the filesystem or device-specific
mmap
function. -
Update statistics and exit.
NOTE: This is present in the book but not relevant to the any of the sections below which reference these functions, so for clarity I'm putting this into a separate section.
-
find_vma() - Finds the VMA that covers a given address. Importantly, if the region does not exist, it returns the VMA closest to the requested address.
-
find_vma_prev() - The same as
find_vma()
except it also provides the VMA pointing to the returned VMA. It is rarely used as typically red-black tree nodes will be required as well sofind_vma_prepare()
is used instead (sys_mprotect() is a notable exception.) -
find_vma_prepare() - The same as
find_vma_prev()
except it will also provide red-black tree nodes needed to perform an insertion into the tree. -
find_vma_intersection() - Returns the VMA that intersects the specified address range. This is useful for checking whether a linear address region is in use by any VMA.
-
vma_merge() - Attempts to expand the specified VMA to cover a new address range. If the VMA can't be expanded forwards, then the next VMA is checked to see if it might be expanded backwards to cover the address range instead. Regions may be merged only if there is no file/device mapping, and permissions match.
-
get_unmapped_area() - Returns the address of a free region of memory large enough to cover the requested size of memory. Used principally when a new VMA is to be created.
-
insert_vm_struct() - Inserts a new VMA into a linear address space.
-
A common operation is to determine the VMA that a particular address belongs to, such as during operations like page faulting. This is handled via find_vma().
-
find_vma()
first checks themmap_cache
field which caches the result of the last call tofind_vma()
as it is likely the same region will be needed a few times in succession. -
If
mmap_cache
doesn't match, the red-black tree stored in themm_rb
field is traversed. -
If the desired address is not contained within any VMA, the function will return the VMA closest to the requested address. This is somewhat unexpected and confusing, so it's important callers double-check to ensure the returned VMA contains the desired address (or take the appropriate action if not.)
-
find_vma_prev() is the same is
find_vma()
but also returns a pointer to the VMA prior to the located one (they are linked together using a singly-linked-list so this makes sense.) It is very similar to find_vma_prepare() which additionally provides red-black tree nodes needed for inserting into the tree. -
These previous-VMA functions are rarely used but a notable use case is where two VMAs are being compared to determine if they may be merged. An additional use case is where a memory region is being removed and the singly-linked-list needs updating.
-
The last function of note for searching VMAs is find_vma_intersection() which is used to find a VMA that overlaps a given address range. The most notable use case for this is during a call to sys_brk() when a region is growing upwards - it's important to ensure the growing region will not overlap an old region.
-
When a new area is to be memory mapped, a free region has to be found that is large enough to contain the new mapping. The function responsible for finding a free area is get_unmapped_area().
-
If a device is being mapped, such as a video card, the associated
get_unmapped_area()
function is called (via the struct file_operationsf_op
field.) This is because devices or files may have additional requirements for mapping above and beyond the generic code, e.g. alignment requirements. -
If there are no special requirements, the architecture-specific function arch_get_unmapped_area() is called. There is a generic version of this available for architectures which don't need to do anything unusual here (i386 included.)
-
(In theory) the principal function for inserting a new memory region is insert_vm_struct(), which subsequently calls __vma_link() to link in the new VMA.
-
However, this function is rarely called directly because it does not increase the
map_count
field, rather __insert_vm_struct() is used which performs the same tasks but also increments this field. LS: the code is literally duplicated apart frommm->map_count++
:) -
Two varieties of linking functions are provided - vma_link() and __vma_link().
-
vma_link()
is intended for use when no locks are held. It'll acquire all the necessary locks, including locking the file if the VMA is a file mapping before calling__vma_link()
which assumes the appropriate locks are held. -
Many functions do not use the insert_vm_struct() functions and instead prefer to call find_vma_prepare() themselves followed by a later vma_link() to avoid having to traverse the tree multiple times.
-
The linking in __vma_link() consists of three stages that are contained in 3 separate functions:
-
__vma_link_list() inserts the VMA into the linear, singly-linked list. If it is the first mapping in the address space, it will become the red-black tree root node.
-
The red-black node is then linked into the tree via __vma_link_rb().
-
Finally, the file share mappings are fixed up via __vma_link_file() which inserts the VMA into the linked list of VMAs using the
vm_pprev_share
andvm_next_share
fields.
-
Prior to 2.4.22 linux used to have a
merge_segments()
function that merged adjacent regions of memory if the file and permissions matched. However, this ended up being overly expensive (esp. in sys_mprotect()) so was removed. -
The equivalent function in 2.4.22 is vma_merge() and is only used in two places.
-
The first is do_mmap_pgoff() (via sys_mmap2()), which calls it if an anonymous region is being mapped, as these are frequently mergeable.
-
The second is during do_brk(), which is expanding one region into a newly allocated one where the two regions should be merged. Rather than merging two regions,
vma_merge()
checks if an existing region can be expanded to satisfy the new allocation, negating the need to create a new region (as discussed previously, a region can be expanded if there are no file or device mappings and the permissions of the two areas are the same.) -
Regions are merged elsewhere, although no function is explicitly called to perform the merging, for example in sys_mprotect() during the fixup of areas where two regions will be merged if permissions match after a permissions change, or during a call to move_vma() when it is likely that similar regions will be located beside each other.
-
mremap() is a userland function hence system call that allows an existing memory mapping to be grown or shrunk. It is implemented by sys_mremap() which may move a memory region if it is growing or would overlap another region, and
MREMAP_FIXED
is not specified in the flags. -
If a region is to be moved, do_mremap() first calls get_unmapped_area() to find a region large enough to contain the new resized mapping and then calls move_vma() to move the old VMA to the new location.
-
move_vma()
does the following:
-
It first checks if the new location can be merged with the VMAs adjacent to the new location. If they cannot be merged a VMA is allocated, literally one PTE at a time.
-
move_page_tables() is called, which copies all the page table entries from the old mapping to the new one. Though there might be better ways of moving the page tables, doing it this way makes error recovery trivial as backtracking is relatively straightforward.
-
The contents of the pages are not copied, rather zap_page_range() is called to swap out or remove all the pages from the old mapping, and the normal page fault handing code will swap the pages back in from backing storage/files or will call the device-specific
do_nopage()
function.
-
Linux can lock pages from an address range into memory using the userland function (hence system call) mlock(), which is implemented by sys_mlock(). Them being 'locked' means that that they will not be swapped out to disk.
-
At a high level
mlock
is simple - it creates a VMA for the address range to be locked, sets theVM_LOCKED
flag on it and forces all the pages to be present via make_pages_present(). -
mlockall() is another userland function/system call that does the same thing as
mlock()
only for every VMA in the calling process. It is implemented via sys_mlockall(). -
Both
mlock
andmlockall
use do_mlock() to do the heavy lifting - finding the affected VMAs and deciding which function is needed to fix up the regions. -
There are some limitations as to what memory can be locked:
-
The address range must be page-aligned, because VMAs are page-aligned. This can be addressed simply by rounding the range up to the nearest page-aligned range.
-
The process limit
RLIMIT_MLOCK
may not be exceeded. This means that each process may only lock half of physical memory at a time. This seems a bit silly since processes can fork and continue to lock further pages, however you need root permissions to do it so if a system admin is being this stupid it's their funeral :)
-
The userland functions (hence system calls) munlock() and munlockall() provide the corollary for the locking functions and are implemented by sys_munlock() and sys_munlockall() respectively, which both rely on do_mmap() to fix up memory regions.
-
These functions are a lot simpler than the locking ones as they do not have to make as many checks as them.
- When locking or unlocking, VMAs will be affected in one of four ways, each of which must be fixed up by mlock_fixup():
-
The locking affects the whole of the VMA - mlock_fixup_all() is called to fix it up.
-
The locking affects the start of the VMA - this means that a new VMA will have to be allocated to map the new area. This is handled via mlock_fixup_start().
-
The locking affects the end of the VMA - this is handled via the aptly named mlock_fixup_end().
-
The locking affects the middle of the VMA - unsurprisingly, this is handled via mlock_fixup_middle(). This case requires two new VMAs to be allocated.
- Interestingly, VMAs created as a result of locking are never merged, even when unlocked. This is because it is presumed that processes that lock regions will need to lock the same regions over + over again so it's not worth constantly merging and splitting them.
-
The function responsible for deleting memory regions or parts of memory regions is do_munmap().
-
The process for this is relatively simple compared to the other memory region-related operations and is divided into 3 parts:
-
Fix up the red-black tree for the region that is about to be unmapped.
-
Release pages and PTEs related to the region to be unmapped.
-
Fix up the regions if a hole has been created.
-
To ensure that the red-black tree is ordered correctly, all VMAs that are affected by the unmap are placed on a linked list local variable called
free
then deleted from the red-black tree via rb_erase(). -
The regions, if they still exist, will be added with their new addresses later during the fixup.
-
Next, the linked-list VMAs on
free
are walked to ensure it isn't a partial unmapping, if it is then this needs to be handled carefully. -
Regardless of whether the region is being partially or fully unmapped, remove_shared_vm_struct() is called to remove share file mappings. If it is partial, this will be recreated during fixup.
-
Next, zap_page_range() is called to remove all pages associated with the region about to be unmapped.
-
unmap_fixup() is then called to handle partial unmappings.
-
Finally, free_pgtables() is called to try to free up all page table entries associated with the unmapped region. This isn't an exhaustive process - only full PGD directories and their entries are unmapped. This is because a finer-grained freeing of page table entries would be too expensive for data structures that are both small and likely to be used again.
-
When a process exits it's necessary to unmap all VMAs associated with a struct mm_struct. This is achieved via exit_mmap().
-
exit_mmap()
simply flushes the CPU cache before walking through the linked list of VMAs, unmapping each of them in turn and freeing up the associated pages before flushing the TLB and deleting the page table entries.
-
A very important part of a VM system is how kernel address space exceptions (those that aren't bugs) are caught.
-
There are two situations where a bad reference may occur:
-
A process sends an invalid pointer to the kernel via a system call which the kernel must be able to safely trap because the only check made initially is that the address is below
PAGE_OFFSET
. -
The kernel uses copy_from_user() or copy_to_user() to read/write data from userspace.
-
At compile time the linker creates an exception table in the __ex_table section of the kernel code segment, which starts at __start___ex_table and ends (exclusive bound) at __stop___ex_table.
-
Each entry in the exception table is of type struct exception_table_entry which contains a pair of addresses -
insn
andfixup
- execution point and fixup routine respectively. -
When an exception occurs that the page fault handler can't manage, it calls search_exception_table() to see if a fixup routine has been provided for an error at the faulting instruction. If module support is enabled, each module's exception table will also be searched.
-
If the address of hte current exception is found in the table, the corresponding location of the fixup code is returned and executed. Section 4.7 will go into more detail as to how this is used to trap bad reads/writes to userspace.
-
Pages in the process linear address space are not necessarily resident in memory.
-
For one, allocations made on behalf of a process are not satisfied immediately, because the space is simply reserved within the struct vm_area_struct. Alternatively, a page may have been swapped out to backing storage, or a user may be attempting to write to a read-only page.
-
Linux, like many operating system, has a Demand Fetch policy for dealing with pages that are not resident - a page is only fetched from backing storage when the hardware raises a page fault exception, which the operating system traps and allocates a page.
-
(At the time of 2.4.22) the characteristics of backing storage suggest prefetching would help, but linux is pretty primitive in this regard.
-
When a page is paged in from swap space, a number of pages up to
2^page_cluster
(page_cluster is a global var set in swap_setup()) are read in by swapin_readahead() and placed in the swap cache. -
Unfortunately, there is only a chance that pages likely to be used soon would be adjacent in the swap area which makes it a poor pre-paging policy, one that adapts to program behaviour would work better (LS - I am sure this is done a lot better in the modern kernel!)
-
There are two types of page fault - major and minor. Major page faults occur when data has to be read from disk, which is an expensive operation. Those that do not require this are considered minor.
-
Major and minor page faults are tracked in the
maj_flt
andmin_flt
fields of struct task_struct. -
The page fault handler needs to be able to deal with the following types of page fault (written in the form severity - description - resolution):
-
Minor - Region valid but page not allocated - Allocate a page frame from the physical page allocator.
-
Minor - Region not valid but is beside an expandable region like the stack - Expand the region and allocate a page.
-
Minor - Page swapped out, but present in swap cache - Re-establish the page in the process page tables and drop a reference to the swap cache.
-
Major - Page swapped out to backing storage - Find where the page with information is stored in the PTE and read it from disk.
-
Minor - Page write when marked read-only - If the page is a COW page make a copy of it, mark it writable and assign it to the process, otherwise send a
SIGSEGV
signal. -
Error - Region is invalid or process has no permissions to access - Send a
SIGSEGV
signal. -
Minor - Fault occurred in the kernel portion of address space - If the fault occurred in the
vmalloc
area of the address space, the current process page tables are updated against the master page table held by init_mm. This is the only valid kernel page fault that may occur. -
Error - Fault occurred in the userspace region while in kernel mode - this means kernel code did not copy from userspace properly and caused a page fault. This is a kernel bug that is taken quite seriously.
-
Each architecture registers an architecture-specific for the handling of page faults. Though this function can be named arbitrarily, it's normally (and certain on i386) called do_page_fault().
-
This function is provided with a lot of information - the address of the fault, whether the page was not found or there was a protection error, whether it was a read/write fault, whether the fault occurred in user/kernel space and more.
-
do_page_fault()
is responsible for determining which type of fault has occurred and how it should be handled by the architecture-independent code:
----------------
| Read address |
| of fault |
----------------
|
v
/------------------------\
/ address > TASK_SIZE && \---\
\ In Kernel Mode / | Yes
\------------------------/ |
| No v
v ------------------
------------------ No /-------------------\ Yes | vmalloc_fault: |
| Find |<---/ in_interrupt() || \----\ ------------------
| vm_area_struct | \ no mm context / | |
------------------ \-------------------/ | |
| | v
v | ---------------
/-------\ No /--------------\ | | Fix up |
/ Valid \------->/ Can grow \ | | page tables |
\ region? / \ nearby region? / | ---------------
\-------/ \--------------/ | |
| Yes Yes | | No | |
| /---------------/ | | v
| | | | /---------\ Yes
| v | | / pte \------\
| /---------\ v | \ _present? / |
| / expand \ No ------------- | \---------/ |
| \ _stack()? /----->| bad_area: | | | No |
| \---------/ ------------- | | |
| | Yes ^ | | | |
v v | | | | |
-------------- /------/ v v | |
| good_area: | | /---------\ Yes --------------- | |
-------------- | / In kernel \--->| no_context: |<-------/ |
| | \ space? / --------------- |
v | \---------/ | |
/-----------\ No | | No | |
/ Permissions \---/ | v |
\ OK? / | /-----------------\ No -------------- |
\-----------/ | / Exception handler \--->| Kernel Bug | |
| Yes | \ exists? / | oops() | |
V | \-----------------/ -------------- |
------------------- | | Yes |
| handle_mm_fault | v v |
------------------- ----------- --------------------- |
| | Send | | Call | |
| | SIGSEGV | | Exception Handler | |
| ----------- --------------------- |
v |
------------------- |
| Fault completed |<----------------------------------------------------------/
-------------------
- handle_mm_fault() is an architecture-independent top-level
function for faulting in a page from backing storage, performing
Copy-On-Write (COW) amongst other tasks. If it returns 1, the fault was
minor, if it returns 2 the fault was major, 0 sends a
SIGBUS
error and any other value invokes the out-of-memory handler.
-
After the exception handler has decided the fault is a valid page fault in a valid memory region, the architecture-independent handle_mm_fault() function takes over.
-
handle_mm_fault()
allocates the required page table entries if they don't already exist and calls handle_pte_fault(). -
handle_pte_fault()
checks whether the PTE is present or not via pte_present() and pte_none(). If no PTE has been allocated -pte_none()
returns true - then do_no_page() which handles Demand Allocation. -
If the PTE is present, then a page has been swapped out to disk and do_swap_page() handles Demand Paging.
-
There is a rare exception where swapped out pages belonging to a virtual file are handled by do_no_page() - page faulting within a virtual file. This case is discussed in section 12.4.
-
If the PTE is write-protected, do_wp_page() is called, as the page is a COW page - as discussed previously, this is a page that is shared between multiple processes (usually parent and child) until a write occurs, after while a private copy is made for the process performing the write.
-
The kernel is able to recognise a COW page because the VMA for the region is marked writable even though the individual PTE is not.
-
If the page is not COW, the page is simply marked dirty because it has been written to.
-
Finally, a page can be read and be marked present and still encounter a fault - this happens for some architectures which do not have a 3-level page table. In this case, the PTE is simply established and marked young.
-
When a process accesses a page for the very first time, the page has to be allocated and (possibly) filled with data via do_no_page().
-
Additionally, if the struct vm_operations_struct associated with the parent VMA (
vm->vm_ops
) provides anopage()
function, it is called - this matters for example for memory-mapped devices such as video cards which need to allocate page and supply data on access, or perhaps for a mapped file which must retrieve its data from backing storage. -
Let's consider a couple of cases:
-
If the
vm->vm_ops
field is not filled or anopage()
function is not supplied, the function do_anonymous_page() is called to handle an anonymous access. -
There are two cases to consider:
-
First time read - this is an easy case to deal with, because no data exists. In this case the system-wide empty_zero_page, which is just a page of zeros, is mapped for the PTE and it is marked write-protected. On i386 this page is zeroed out in mem_init.
-
First time write - alloc_page() is called to allocate a free page (discussed further in chapter 6), which is zero-filled by clear_user_highpage(). Assuming the page was successfully allocated, the
rss
field in the struct mm_struct will be incremented andflush_page_to_ram()
is called on some brain-dead architectures that aren't i386 :) The page is then inserted on the LRU lists so it may be reclaimed later by the page reclaiming code. Finally, the page table entries for the process are updated for the new mapping.
-
If a VMA is backed by a file or device, a
nopage()
function will be provided within the VMA'svm_ops
struct vm_operations_struct. -
In the case of a file-backed VMA, the function filemap_nopage() is often used for this - providing functionality to allocate a page and read a page-sized amount of data from disk.
-
Pages backed by a virtual file, such as those provided by shmfs for example (later renamed to tmpfs), will use the function shmem_nopage() (see chapter 12 for more details on this.)
-
Each device driver provides a different
nopage()
- we don't care about the internals of each, so long as it returns astruct page
to use. -
Once the page is returned, checks are made to ensure a page was successfully allocated and appropriate errors were returned if not.
-
A check is then performed to see if an early COW break should take place - if the fault is caused by a write to the page and the ¬VM_SHARED` flag is not set in the managing VMA then a break will occur.
-
An 'early COW break' is a case of allocating a new page and copying the data across before reducing the reference count to the page returned by the
nopage()
function. -
In either case, a check is then made using pte_none() to ensure that a PTE is not already in the apge table that is about to be used. This is because, in SMP, two faults could occur for the same page at close to the same time and because the spinlocks are not held for the full duration of the fault, the check has to be made at the last instant. If there has been no race, the PTE is assigned, stats are updated and the brain-dead architecture hooks for cache coherency are called.
-
When a page is swapped out to backing storage, the function do_swap_page() is responsible for reading the page back in (with the exception of virtual files, which are covered in chapter 12.)
-
The information needed to find the page in the swap is stored within the PTE itself (LS - but surely it's swapped out?! Maybe there's a shared page it references which it indexes in to? Confused). Because pages may be shared between multiple processes, they cannot always be swapped out immediately - instead they are placed within the 'swap cache' in this situation.
-
A shared page cannot be swapped out immediately because (in 2.4.22) there is no way of mapping a struct page to all the PTEs of each process it is shared between - searching the page tables of all processes is simply too expensive. In more recent kernels (well at least in early 2.6, the implementation may have changed since) there is 'Reverse Mapping (RMAP)' which provides means of determining this information.
-
Given there is a swap cache, it's possible that when a fault occurs the page still exists there. In this case, the reference count to the page is simply increased and it is placed within the process page tables again, and registers as a minor page fault.
-
If the page exists only on disk, swapin_readahead() is called which reads the requested page and a number of pages after it. The number of pages read is determined by page_cluster (on i386 this is set to 3, or 2 if there is less than 16MiB of RAM on the system.)
-
The number of pages read is
2^page_cluster
, unless a bad or empty swap entry is encountered. Read-ahead is performed on the assumption that seeking is the most expensive operation time-wise, so after it is complete succeeding pages should also be read in.
-
Once upon a time a fork would mean the entire parent address space would have to be duplicated - this is an extremely expensive operation because it could result a significant percentage of the process would have to be swapped in from backing storage.
-
Copy-on-Write (COW) is a solution to this issue - during a fork the PTEs of both the parent and the child process are made read-only so that when a write occurs, there will be a page fault.
-
As discussed previously, linux recognises a COW page because even though the PTE is write-protected, the controlling VMA shows the region is writable.
-
The function do_wp_page() handles this case by making a copy of the page and assigning it to the writing process. If necessary, a new swap slot will be reserved for the page.
-
Using the COW method, only the page table entries have to be copied during a fork, rather more efficient overall!
-
It is not safe to access memory in the process address space directly, because there is no way to quickly check if the page addressed is resident or not.
-
Linux relies on the MMU to raise exceptions when the address is invalid and use the Page Fault Exception handler to catch the exception and clean up.
-
When we are in kernel mode we can't do this in the usual way as we don't have anybody to clean up our mess for us, as previously discussed kernel page faults only make sense when interacting with the
vmalloc
region. -
In the i386 case, checks are performed by access_ok() before __copy_user() or __copy_user_zeroing() which use assembly to provide fixup code for cases where the address really is fucked - this will be found by search_exception_table() on page fault and trapped, preventing the general kernel page fault code from running and a potential oops, and instead running fixup code which zeros the remaining buffer space.
-
Let's take a look at the functions for safely copying data to/from userland:
- copy_from_user() -Copies bytes from the user address space to the kernel address space.
- copy_to_user() - Copies bytes from the kernel address space to the user address space.
- copy_user_page() - Copies data to an anonymous or COW page in userspace. Brain-dead architectures need to be careful about cache aliasing. LS - Doesn't seem to do any checks?!
- clear_user_page() - Similar to
copy_user_page()
, only it zeros a page instead. LS - doesn't seem to do any checks?! - get_user() - Copies an integer value from the user address space to the kernel address space.
- put_user() - Copies an integer value from the kernel address space to the user address space.
- strncpy_from_user() - Copies a null-terminated string of maximum size as specified from the user address space to the kernel address space.
- strlen_user() - Determines the length (with specified upper bound) of the userspace string including the trailing NULL byte.
- access_ok() - Returns non-zero if the specified userspace block of memory is valid, or 0 if not.
-
Generally these functions are implemented as macros and all behave relatively similarly.
-
Considering copy_from_user() on i386 - if the size of the copy is known at compile time, it uses __constant_copy_from_user(), otherwise it uses __generic_copy_from_user().
-
In the constant copy case, the code eventually calls __constant_copy_user() which optimises based on the bytes of the 'stride' of the data being read.
-
In the generic copy case, it eventually calls __copy_user_zeroing(). As discussed above, this sets an exception table for search_exception_table() to find which allows fixup code to zero the remains of the buffer.
-
Doing things this way allows the kernel to safely access userspace without performing overly expensive checks first.