Skip to content
This repository has been archived by the owner on Dec 3, 2022. It is now read-only.

Latest commit

 

History

History
888 lines (764 loc) · 46.5 KB

process.md

File metadata and controls

888 lines (764 loc) · 46.5 KB

Process Address Space

64-bit Address Space

Userland (128TiB)

                        0000000000000000 -> |---------------| ^
                                            |    Process    | |
                                            |    address    | | 128 TiB
                                            |     space     | |
                        0000800000000000 -> |---------------| v

             .        ` .     -                 `-       ./   _
                      _    .`   -   The netherworld of  `/   `
            -     `  _        |  /      unavailable sign-extended -/ .
             ` -        .   `  48-bit address space  -     \  /    -
           \-                - . . . .             \      /       -

Kernel (128TiB)

                        ffff800000000000 -> |----------------| ^
                                            |   Hypervisor   | |
                                            |    reserved    | | 8 TiB
                                            |      space     | |
        __PAGE_OFFSET = ffff880000000000 -> |----------------| x
                                            | Direct mapping | |
                                            |  of all phys.  | | 64 TiB
                                            |     memory     | |
                        ffffc80000000000 -> |----------------| v
                                            /                /
                                            \      hole      \
                                            /                /
        VMALLOC_START = ffffc90000000000 -> |----------------| ^
                                            |    vmalloc/    | |
                                            |    ioremap     | | 32 TiB
                                            |     space      | |
      VMALLOC_END + 1 = ffffe90000000000 -> |----------------| v
                                            /                /
                                            \      hole      \
                                            /                /
        VMEMMAP_START = ffffea0000000000 -> |----------------| ^
                                            |     Virtual    | |
                                            |   memory map   | | 1 TiB
                                            |  (struct page  | |
                                            |     array)     | |
                        ffffeb0000000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
                        ffffec0000000000 -> |----------------| ^
                                            |  Kasan shadow  | | 16 TiB
                                            |     memory     | |
                        fffffc0000000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
     ESPFIX_BASE_ADDR = ffffff0000000000 -> |----------------| ^
                                            |   %esp fixup   | | 512 GiB
                                            |     stacks     | |
                        ffffff8000000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
           EFI_VA_END = ffffffef00000000 -> |----------------| ^
                                            |   EFI region   | | 64 GiB
                                            | mapping space  | |
         EFI_VA_START = ffffffff00000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
   __START_KERNEL_map = ffffffff80000000 -> |----------------| ^
                                            |  Kernel text   | | 512 MiB
                                            |    mapping     | |
        MODULES_VADDR = ffffffffa0000000 -> |----------------| x
                                            |     Module     | |
                                            |    mapping     | | 1.5 GiB
                                            |     space      | |
                        ffffffffff600000 -> |----------------| x
                                            |   vsyscalls    | | 8 MiB
                        ffffffffffe00000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
                                            ------------------
  • In current x86-64 implementations only the lower 48 bits of an address are used - the remaining higher order bits must all be equal to the 48th bit, i.e. allowable addresses are 128TiB (47 bits) in the 0000000000000000 - 00007fffffffffff range, and 128TiB (47 bits) in the ffff800000000000 - ffffffffffffffff range - the address space is divided into upper and lower portions.

  • In fact, reading the x86-64 memory map doc, only 46 bits (64TiB) of RAM is supported - this makes sense, as it allows for the entire physical address space to be mapped into the kernel while leaving another 64TiB free for other kernel data.

  • In linux, like most (if not all?) other operating systems, this provides a nice separation between kernel and user address space for free - keep kernel addresses in the upper portion and user addresses in the lower portion.

Kernel Address Translation

  • We can see there's a gap reserved for the hypervisor between ffff800000000000 and ffff87ffffffffff, immediately after which the entire 64TiB physical address space is mapped.

  • This allows us to have a simple means of translating from a physical address to a virtual one within the kernel - simply offset the physical one by ffff880000000000. This constant is defined as PAGE_OFFSET (an alias for __PAGE_OFFSET.)

  • More generally to translate from virtual to physical addresses and vice-versa in the kernel, two functions are available - phys_to_virt() (a wrapper around __va()) and virt_to_phys() (a wrapper around __pa().)

  • Give that we have PAGE_OFFSET __va() is simple:

#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#define __pa(x) __phys_addr((unsigned long)(x))

...

#define __phys_addr(x) __phys_addr_nodebug(x)

...

static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
        unsigned long y = x - __START_KERNEL_map;

        /* use the carry flag to determine if x was < __START_KERNEL_map */
        x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));

        return x;
}
  • What is __START_KERNEL_map (which is defined as ffffffff80000000)? We can see from the memory map that this is the virtual address above which the kernel is loaded:
...
ffffffff80000000 - ffffffffa0000000 (=512 MB)  kernel text mapping, from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole

Memory Descriptors

  • Each process's memory state is represented by a 'memory descriptor', struct mm_struct, which specifies the process's PGD (see page tables), amongst a number of other fields.

  • Looking at the struct mm_struct, assuming x86-64 with a pretty standard configuration in order to strip irrelevant CONFIG_xxx fields:

struct mm_struct {
        struct vm_area_struct *mmap;            /* list of VMAs */
        struct rb_root mm_rb;
        u32 vmacache_seqnum;                   /* per-thread vmacache */

        unsigned long (*get_unmapped_area) (struct file *filp,
                                unsigned long addr, unsigned long len,
                                unsigned long pgoff, unsigned long flags);

        unsigned long mmap_base;                /* base of mmap area */
        unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
        unsigned long task_size;                /* size of task vm space */
        unsigned long highest_vm_end;           /* highest vma end address */
        pgd_t * pgd;
        atomic_t mm_users;                      /* How many users with user space? */
        atomic_t mm_count;                      /* How many references to "struct mm_struct" (users count as 1) */
        atomic_long_t nr_ptes;                  /* PTE page table pages */
        atomic_long_t nr_pmds;                  /* PMD page table pages */

        int map_count;                          /* number of VMAs */

        spinlock_t page_table_lock;             /* Protects page tables and some counters */
        struct rw_semaphore mmap_sem;

        struct list_head mmlist;                /* List of maybe swapped mm's.  These are globally strung
                                                 * together off init_mm.mmlist, and are protected
                                                 * by mmlist_lock
                                                 */

        unsigned long hiwater_rss;      /* High-watermark of RSS usage */
        unsigned long hiwater_vm;       /* High-water virtual memory usage */

        unsigned long total_vm;         /* Total pages mapped */
        unsigned long locked_vm;        /* Pages that have PG_mlocked set */
        unsigned long pinned_vm;        /* Refcount permanently increased */
        unsigned long data_vm;          /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
        unsigned long exec_vm;          /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
        unsigned long stack_vm;         /* VM_STACK */
        unsigned long def_flags;
        unsigned long start_code, end_code, start_data, end_data;
        unsigned long start_brk, brk, start_stack;
        unsigned long arg_start, arg_end, env_start, env_end;

        unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

        /*
         * Special counters, in some configurations protected by the
         * page_table_lock, in other configurations by being atomic.
         */
        struct mm_rss_stat rss_stat;

        struct linux_binfmt *binfmt;

        cpumask_var_t cpu_vm_mask_var;

        /* Architecture-specific MM context */
        mm_context_t context;

        unsigned long flags; /* Must use atomic bitops to access the bits */

        struct core_state *core_state; /* coredumping support */

        spinlock_t                      ioctx_lock;
        struct kioctx_table __rcu       *ioctx_table;

        /*
         * "owner" points to a task that is regarded as the canonical
         * user/owner of this mm. All of the following must be true in
         * order for it to be changed:
         *
         * current == mm->owner
         * current->mm != mm
         * new_owner->mm == mm
         * new_owner->alloc_lock is held
         */
        struct task_struct __rcu *owner;

        /* store ref to file /proc/<pid>/exe symlink points to */
        struct file __rcu *exe_file;

        struct mmu_notifier_mm *mmu_notifier_mm;

        /*
         * An operation with batched TLB flushing is going on. Anything that
         * can move process memory needs to flush the TLB when moving a
         * PROT_NONE or PROT_NUMA mapped page.
         */
        bool tlb_flush_pending;

        struct uprobes_state uprobes_state;

        atomic_long_t hugetlb_usage;
};

Virtual Memory Areas

     |------------------|      |------------------|      |------------------|
     | struct mm_struct | .... | struct mm_struct | .... | struct mm_struct |
     |------------------|      |------------------|      |------------------|
                                  mmap | ^
                                       | |
                 /---------------------/ \-----------------------\
                 |                                               |
                 v                                               | vm_mm
     |-----------------------|       vm_next         |-----------------------| vm_next
     |                       | --------------------> |                       | - - ->
     | struct vm_area_struct |       vm_prev         | struct vm_area_struct | vm_prev
     |                       | <-------------------- |                       | <- - -
     |-----------------------|                       |-----------------------|
                 ^    | vm_file                                  | vm_ops
                 |    |                                          |
                 |    |                                          v
                 |    \-----------------\        |-----------------------------|
                 |                      |        | struct vm_operations_struct |
                 |                      v        |-----------------------------|
                 |               |-------------|
                 |               | struct file |
    Other VMAs   |               |-------------|
          .      |            f_mapping |
           .     |                      |
            .    |                      v
             \   |           |----------------------|
              \  |           | struct address_space |
               \ |           |----------------------|
                \|         i_mmap |     ^      | a_ops
(red-black tree) \----------------/     |      \---------------\
                                        |                      |
                                        |                      v
                                        |     |----------------------------------|
                                        |     |  struct address_space_operations |
                                        |     |----------------------------------|
                                        |
                       -  - - --/-------X-------\-- - -  -
                                |               |
                        mapping |               | mapping
                         |-------------| |-------------|
                         | struct page | | struct page |
                         |-------------| |-------------|
  • The above diagram shows the relationship between the various process address space structures (far from exhaustively), which we'll get into below:

  • Each process's virtual memory is additionally divided into non-overlapping regions (Virtual Memory Areas or 'VMA's) related by their purpose and protection state.

  • VMAs are represented by the struct vm_area_struct type.

  • A struct mm_struct's VMAs are stored both as a doubly-linked list and a red-black tree, the head of the linked list being kept in the mm_struct's mmap field, and previous/next nodes kept in the struct vm_area_struct's vm_prev and vm_next fields, sorted in address order, and the red/black root in the mm_struct's struct rb_root mm_rb field, with the node kept in the vm_area_struct's struct rb_node vm_rb field.

  • Looking at the struct vm_area_struct (again assuming x86-64 with a pretty standard configuration in order to strip irrelevant CONFIG_xxx fields):

struct vm_area_struct {
        /* The first cache line has the info for VMA tree walking. */

        unsigned long vm_start;         /* Our start address within vm_mm. */
        unsigned long vm_end;           /* The first byte after our end address
                                           within vm_mm. */

        /* linked list of VM areas per task, sorted by address */
        struct vm_area_struct *vm_next, *vm_prev;

        struct rb_node vm_rb;

        /*
         * Largest free memory gap in bytes to the left of this VMA.
         * Either between this VMA and vma->vm_prev, or between one of the
         * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
         * get_unmapped_area find a free area of the right size.
         */
        unsigned long rb_subtree_gap;

        /* Second cache line starts here. */

        struct mm_struct *vm_mm;        /* The address space we belong to. */
        pgprot_t vm_page_prot;          /* Access permissions of this VMA. */
        unsigned long vm_flags;         /* Flags, see mm.h. */

        /*
         * For areas with an address space and backing store,
         * linkage into the address_space->i_mmap interval tree.
         */
        struct {
                struct rb_node rb;
                unsigned long rb_subtree_last;
        } shared;

        /*
         * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
         * list, after a COW of one of the file pages.  A MAP_SHARED vma
         * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
         * or brk vma (with NULL file) can only be in an anon_vma list.
         */
        struct list_head anon_vma_chain; /* Serialized by mmap_sem &
                                          * page_table_lock */
        struct anon_vma *anon_vma;      /* Serialized by page_table_lock */

        /* Function pointers to deal with this struct. */
        const struct vm_operations_struct *vm_ops;

        /* Information about our backing store: */
        unsigned long vm_pgoff;         /* Offset (within vm_file) in PAGE_SIZE
                                           units */
        struct file * vm_file;          /* File we map to (can be NULL). */
        void * vm_private_data;         /* was vm_pte (shared mem) */

        struct mempolicy *vm_policy;    /* NUMA policy for the VMA */

        struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
};
struct vm_operations_struct {
        void (*open)(struct vm_area_struct * area);
        void (*close)(struct vm_area_struct * area);
        int (*mremap)(struct vm_area_struct * area);
        int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
        int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
                                                pmd_t *, unsigned int flags);
        void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);

        /* notification that a previously read-only page is about to become
         * writable, if an error is returned it will cause a SIGBUS */
        int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);

        /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
        int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);

        /* called by access_process_vm when get_user_pages() fails, typically
         * for use by special VMAs that can switch between memory and hardware
         */
        int (*access)(struct vm_area_struct *vma, unsigned long addr,
                      void *buf, int len, int write);

        /* Called by the /proc/PID/maps code to ask the vma whether it
         * has a special name.  Returning non-NULL will also cause this
         * vma to be dumped unconditionally. */
        const char *(*name)(struct vm_area_struct *vma);

        /*
         * set_policy() op must add a reference to any non-NULL @new mempolicy
         * to hold the policy upon return.  Caller should pass NULL @new to
         * remove a policy and fall back to surrounding context--i.e. do not
         * install a MPOL_DEFAULT policy, nor the task or system default
         * mempolicy.
         */
        int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);

        /*
         * get_policy() op must add reference [mpol_get()] to any policy at
         * (vma,addr) marked as MPOL_SHARED.  The shared policy infrastructure
         * in mm/mempolicy.c will do this automatically.
         * get_policy() must NOT add a ref if the policy at (vma,addr) is not
         * marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
         * If no [shared/vma] mempolicy exists at the addr, get_policy() op
         * must return NULL--i.e., do not "fallback" to task or system default
         * policy.
         */
        struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
                                        unsigned long addr);
        /*
         * Called by vm_normal_page() for special PTEs to find the
         * page for @addr.  This is useful if the default behavior
         * (using pte_page()) would not find the correct page.
         */
        struct page *(*find_special_page)(struct vm_area_struct *vma,
                                          unsigned long addr);
};
  • Of most note here are fault(), which is called when a page fault occurs and map_pages() which maps a specified range of addresses. These both use struct vm_fault to parameterise their operation and to place certain out fields:
struct vm_fault {
        unsigned int flags;             /* FAULT_FLAG_xxx flags */
        gfp_t gfp_mask;                 /* gfp mask to be used for allocations */
        pgoff_t pgoff;                  /* Logical page offset based on vma */
        void __user *virtual_address;   /* Faulting virtual address */

        struct page *cow_page;          /* Handler may choose to COW */
        struct page *page;              /* ->fault handlers should return a
                                         * page here, unless VM_FAULT_NOPAGE
                                         * is set (which is also implied by
                                         * VM_FAULT_ERROR).
                                         */
        /* for ->map_pages() only */
        pgoff_t max_pgoff;              /* map pages for offset from pgoff till
                                         * max_pgoff inclusive */
        pte_t *pte;                     /* pte entry associated with ->pgoff */
};
const struct vm_operations_struct generic_file_vm_ops = {
        .fault          = filemap_fault,
        .map_pages      = filemap_map_pages,
        .page_mkwrite   = filemap_page_mkwrite,
};
struct address_space {
        struct inode            *host;          /* owner: inode, block_device */
        struct radix_tree_root  page_tree;      /* radix tree of all pages */
        spinlock_t              tree_lock;      /* and lock protecting it */
        atomic_t                i_mmap_writable;/* count VM_SHARED mappings */
        struct rb_root          i_mmap;         /* tree of private and shared mappings */
        struct rw_semaphore     i_mmap_rwsem;   /* protect tree, count, list */
        /* Protected by tree_lock together with the radix tree */
        unsigned long           nrpages;        /* number of total pages */
        /* number of shadow or DAX exceptional entries */
        unsigned long           nrexceptional;
        pgoff_t                 writeback_index;/* writeback starts here */
        const struct address_space_operations *a_ops;   /* methods */
        unsigned long           flags;          /* error bits/gfp mask */
        spinlock_t              private_lock;   /* for use by the address_space */
        struct list_head        private_list;   /* ditto */
        void                    *private_data;  /* ditto */
}
  • Note the struct rb_root field i_mmap - this provides a red-black tree root listing private and shared VMAs which map the file, including the VMA that references the address space via vm_file->f_mapping.

  • The operations a struct address_space needs to perform are provided via its struct address_space_operations a_ops field, for example reading/writing pages from/to the file, etc.:

struct address_space_operations {
        int (*writepage)(struct page *page, struct writeback_control *wbc);
        int (*readpage)(struct file *, struct page *);

        /* Write back some dirty pages from this mapping. */
        int (*writepages)(struct address_space *, struct writeback_control *);

        /* Set a page dirty.  Return true if this dirtied it */
        int (*set_page_dirty)(struct page *page);

        int (*readpages)(struct file *filp, struct address_space *mapping,
                        struct list_head *pages, unsigned nr_pages);

        int (*write_begin)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned flags,
                                struct page **pagep, void **fsdata);
        int (*write_end)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned copied,
                                struct page *page, void *fsdata);

        /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
        sector_t (*bmap)(struct address_space *, sector_t);
        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, gfp_t);
        void (*freepage)(struct page *);
        ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
        /*
         * migrate the contents of a page to the specified target. If
         * migrate_mode is MIGRATE_ASYNC, it must not block.
         */
        int (*migratepage) (struct address_space *,
                        struct page *, struct page *, enum migrate_mode);
        int (*launder_page) (struct page *);
        int (*is_partially_uptodate) (struct page *, unsigned long,
                                        unsigned long);
        void (*is_dirty_writeback) (struct page *, bool *, bool *);
        int (*error_remove_page)(struct address_space *, struct page *);

        /* swapfile support */
        int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                                sector_t *span);
        void (*swap_deactivate)(struct file *file);
};
  • Non-anonymously mapped struct page's which represent the underlying physical pages of memory mapped to a struct address_space reference it via their mapping field (if & only if the low bit of the mapping field is clear.)

Page Faulting

  • A page fault occurs when either a virtual memory address is not mapped, or it is and no physical page is actually mapped to the address (i.e. pte_present() returns 0 on the mapping's PTE), or a write is attempted on a read-only mapping (i.e. pte_write() returns 0 on the mapping's PTE) or when userland tries to access a kernel mapping.

  • This isn't necessary an error, in fact modern operating systems (including linux, naturally) use page faulting for a number of different mechanisms:

  1. Swap - 'Swapped out' memory is data that has been written to the disk so that RAM can be used for other purposes. The memory is marked non-present, then when a fault occurs the data can be read from disk and placed in memory for the referencing process to access.

  2. Demand paging - Under demand paging, physical pages are not actually mapped until they are used. This allows for a far more efficient means of allocation of memory compared to the case where memory must be fully allocated as soon as it is requested - as soon as the memory is actually used in some way, a page fault occurs and the kernel allocates the memory.

  3. Copy-on-write semantics - Since page faults occur when a read-only page of memory is attempted to be written to, this can be used to significantly improve the efficiency of a fork. Rather than copying the allocated pages of the parent process, the pages can be marked read-only in both parent and child, and as soon as one writes to a page the arising fault triggers a copy of the page. Since most process's pages of memory are only read, this hugely increases the efficiency of forks.

Handling Page Faults

  • Under x86, the kernel can specify an interrupt handler for page faults which is invoked by the CPU when one occurs. When the handler function is invoked, an error code describing the cause of the fault is pushed to the stack and the cr2 control register is set to the linear address that caused the fault.

  • The linux page fault handler is do_page_fault(). It retrieves the error code and address before handing off the heavy lifting to __do_page_fault().

  • __do_page_fault() performs a number of checks to handle the error cases, before handing off to the non-arch specific handle_mm_fault() if the fault looks valid. Diagrammatically (simplifying the process somewhat, ignoring kmemcheck and kprobes, some special case fault support, and vsyscall emulation - better to look at the code for those cases):

                                       -------------------
                                       |   CPU invokes   |
                                       | do_page_fault() |
                                       -------------------
                                                |
                                                v
                                    ------------------------
                                    | Retrieve address and |
                                    | error code then call |
                                    |   __do_page_fault()  |
                                    ------------------------
                                                |
                                                v
                                   /------------------------\
                                  /  Did we fault in kernel  \---\
                                  \          space?          /   | Yes
                                   \------------------------/    |
                                      | No                       v
                                      v                 /-----------------\ Yes
   ---------               Yes /-------------\         /  vmalloc_fault()  \----\
   | OOPS! |<-----------------/ Reserved bits \        \      succeed?     /    |
   ---------                  \   modified?   /         \-----------------/     |
                               \-------------/                   | No           |
                                      | No                       v              |
                                      v                    /-----------\ Yes    |
                          Yes /--------------\            /  Was fault  \-------\
   /-------------------------/      SMAP      \           \  spurious?  /       |
   |                         \   violation?   /            \-----------/        |
   |                          \--------------/                   | No           |
   |                                  | No                       |              |
   /----------------------------------(--------------------------/              |
   |                                  v                                         |
   |                  Yes /-----------------------\                             |
   /---------------------/  Fault handler disabled \                            |
   |                     \    or !current->mm?     /                            |
   |                      \-----------------------/                             |
   |                                  | No                                      |
   |                                  v                                         |
   |                         --------------------                               |
   |                         | Find nearest VMA |<--------\                     |
   |                         --------------------         |                     |
   |                                  |                   |                     |
   |                                  v                   |                     |
   |  No /-------------\  No /----------------\           |                     |
   /----/  Is region a  \<--/     Does VMA     \          | No                  |
   |    \     stack?    /   \ contain address? /    /------------\ Yes          |
   |     \-------------/     \----------------/    / Fatal signal \--\          |
   |            | Yes                 | Yes        \   pending?   /  |          |
   |            v                     |             \------------/   |          |
   | Yes /--------------\             |                   ^          |          |
   /----/  Is address <  \            |                   |          |          |
   |    \ stack pointer? /            |                   |          |          |
   |     \--------------/             |                   |          |          |
   |            | No                  |                   |          |          |
   |            v                     v             --------------   |          |
   | No /---------------\ Yes   /-----------\       | Mark retry |   |          |
   /---/  expand_stack() \---> / Permissions \      | disallowed |   |          |
   |   \    succeed?     /     \     OK?     /      --------------   |          |
   |    \---------------/       \-----------/             ^          |          |
   |                         No   |   | Yes               |          |          |
   /------------------------------/   |                   |          |          |
   |                                  v                   |          |          |
   |                        ---------------------         | Yes      |          |
   |                        |        Call       |     /--------\ No  |          |
   |                        | handle_mm_fault() |    /  Retry   \----\          |
   |                        ---------------------    \ allowed? /    |          |
   |                                  |               \--------/     |          |
   |                                  v                   ^          |          |
   |                          /---------------\  Yes      |          v          |
   |                         / Return Value &  \----------/     /---------\ Yes |
   |                         \ VM_FAULT_RETRY? /               /   From    \----\
   |                          \---------------/                \ userland? /    |
   |                                  | No                      \---------/     |
   |                                  v                              | No       |
   |  ------------------  Yes /---------------\                      |          |
   |  | mm_fault_error |<----/ Return value &  \                     |          |
   |  ------------------     \ VM_FAULT_ERROR? /                     |          |
   |           |              \---------------/                      |          |
   |           |                      | No                           |          |
   |           |                      v                              |          |
   |           |              ******************                     |          |
   |           |              * FAULT COMPLETE *<--------------------)----------/
   |           |              ******************                     |
   |           |                                                     |
   |           \---------------\                                     |
   |                           |                                     |
   |                           v                                     |
   |                /--------------------\ Yes                       |
   |               / Fatal signal pending \--------------------------\
   |               \   in kernel mode?    /                          |
   |                \--------------------/                           |
   |                           | No                                  |
   |                           v                                     |
   |                   /--------------\ Yes     /---------\ Yes      |
   |                  /     fault &    \------>/ In kernel \---------\
   |                  \  VM_FAULT_OOM? /       \   mode?   /         |
   |                   \--------------/         \---------/          |
   |                           | No                  | No            |
   |                           v                     v               |
   |  ----------  Yes /-----------------\     --------------         |
   |  |  Send  |<----/ fault & VM_FAULT_ \    | Invoke OOM |         |
   |  | SIGBUS |     \ SIGBUS/HWPOISON*? /    |   killer   |         |
   |  ----------      \-----------------/     --------------         |
   |                           | No                                  |
   |                           v                                     |
   |                  /-----------------\ No  ----------             |
   |                 /      fault &      \--->| BUG()! |             |
   |                 \  VM_FAULT_SIGSEGV /    ----------             |
   |                  \-----------------/                            |
   |                           | Yes                                 |
   |                           v                                     v
   |             -----------------------------               -----------------
   \------------>| __bad_area_no_semaphore() |       /------>|  no_context() |
                 -----------------------------       |       -----------------
                               |                     |               |
                               v                     |               v
                  /------------------------\ Yes     |     /------------------\ No
                 /  Did we fault in kernel  \--------/    / Is there a kernel  \--\
                 \          space?          /             \ exception handler? /  |
                  \------------------------/               \------------------/   |
                               | No                                  | Yes        |
                               v                                     v            |
                    /-------------------\ Yes               ------------------    |
                   / Attempted access of \-----------\      | Call exception |    |
                   \    kernel memory?   /           |      |     handler    |    |
                    \-------------------/            |      ------------------    |
                               | No                  |                            |
                               v                     v                            |
                       ----------------     -------------------   ---------       |
                       | Send SIGSEGV |<--- | Mark protection |   | OOPS! |<------/
                       ----------------     |      fault      |   ---------
                                            -------------------

Kernel faults

  • Page faults in the kernel are not permitted, except for the case of vmalloc memory (will discuss this in a later section), which is used to allow for virtually contiguous memory.

Page Fault Error Code

  • The error code specified by x86 consists of a bitfield of enum x86_pf_error_code providing information on the cause of the page fault:
/*
 * Page fault error code bits:
 *
 *   bit 0 ==    0: no page found       1: protection fault
 *   bit 1 ==    0: read access         1: write access
 *   bit 2 ==    0: kernel-mode access  1: user-mode access
 *   bit 3 ==                           1: use of reserved bit detected
 *   bit 4 ==                           1: fault was an instruction fetch
 *   bit 5 ==                           1: protection keys block access
 */
enum x86_pf_error_code {

        PF_PROT         =               1 << 0,
        PF_WRITE        =               1 << 1,
        PF_USER         =               1 << 2,
        PF_RSVD         =               1 << 3,
        PF_INSTR        =               1 << 4,
        PF_PK           =               1 << 5,
};
  • This is used by __do_page_fault() and the functions it invokes to handle cases correctly. Note that it enables the code to determine whether the access was made from userland or kernel mode, and given we know the address we can determine whether it is a kernel address or userland one.

  • Page faults are divided into 3 types - minor, major and error, the latter two cases represented by VM_FAULT_MAJOR and VM_FAULT_ERROR respectively (VM_FAULT_ERROR is a bitmask of error states.)