Skip to content
This repository has been archived by the owner on Dec 3, 2022. It is now read-only.

Latest commit

 

History

History
1182 lines (990 loc) · 61.2 KB

process.md

File metadata and controls

1182 lines (990 loc) · 61.2 KB

Process Address Space

64-bit Address Space

Userland (128TiB)

                        0000000000000000 -> |---------------| ^
                                            |    Process    | |
                                            |    address    | | 128 TiB
                                            |     space     | |
                        0000800000000000 -> |---------------| v

             .        ` .     -                 `-       ./   _
                      _    .`   -   The netherworld of  `/   `
            -     `  _        |  /      unavailable sign-extended -/ .
             ` -        .   `  48-bit address space  -     \  /    -
           \-                - . . . .             \      /       -

Kernel (128TiB)

                        ffff800000000000 -> |----------------| ^
                                            |   Hypervisor   | |
                                            |    reserved    | | 8 TiB
                                            |      space     | |
        __PAGE_OFFSET = ffff880000000000 -> |----------------| x
                                            | Direct mapping | |
                                            |  of all phys.  | | 64 TiB
                                            |     memory     | |
                        ffffc80000000000 -> |----------------| v
                                            /                /
                                            \      hole      \
                                            /                /
        VMALLOC_START = ffffc90000000000 -> |----------------| ^
                                            |    vmalloc/    | |
                                            |    ioremap     | | 32 TiB
                                            |     space      | |
      VMALLOC_END + 1 = ffffe90000000000 -> |----------------| v
                                            /                /
                                            \      hole      \
                                            /                /
        VMEMMAP_START = ffffea0000000000 -> |----------------| ^
                                            |     Virtual    | |
                                            |   memory map   | | 1 TiB
                                            |  (struct page  | |
                                            |     array)     | |
                        ffffeb0000000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
                        ffffec0000000000 -> |----------------| ^
                                            |  Kasan shadow  | | 16 TiB
                                            |     memory     | |
                        fffffc0000000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
     ESPFIX_BASE_ADDR = ffffff0000000000 -> |----------------| ^
                                            |   %esp fixup   | | 512 GiB
                                            |     stacks     | |
                        ffffff8000000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
           EFI_VA_END = ffffffef00000000 -> |----------------| ^
                                            |   EFI region   | | 64 GiB
                                            | mapping space  | |
         EFI_VA_START = ffffffff00000000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
   __START_KERNEL_map = ffffffff80000000 -> |----------------| ^
                                            |  Kernel text   | | 512 MiB
                                            |    mapping     | |
        MODULES_VADDR = ffffffffa0000000 -> |----------------| x
                                            |     Module     | |
                                            |    mapping     | | 1.5 GiB
                                            |     space      | |
                        ffffffffff600000 -> |----------------| x
                                            |   vsyscalls    | | 8 MiB
                        ffffffffffe00000 -> |----------------| v
                                            /                /
                                            \    'unused'    \
                                            /      hole      /
                                            \                \
                                            ------------------
  • In current x86-64 implementations only the lower 48 bits of an address are used - the remaining higher order bits must all be equal to the 48th bit, i.e. allowable addresses are 128TiB (47 bits) in the 0000000000000000 - 00007fffffffffff range, and 128TiB (47 bits) in the ffff800000000000 - ffffffffffffffff range - the address space is divided into upper and lower portions.

  • In fact, reading the x86-64 memory map doc, only 46 bits (64TiB) of RAM is supported - this makes sense, as it allows for the entire physical address space to be mapped into the kernel while leaving another 64TiB free for other kernel data.

  • In linux, like most (if not all?) other operating systems, this provides a nice separation between kernel and user address space for free - keep kernel addresses in the upper portion and user addresses in the lower portion.

Kernel Address Translation

  • We can see there's a gap reserved for the hypervisor between ffff800000000000 and ffff87ffffffffff, immediately after which the entire 64TiB physical address space is mapped.

  • This allows us to have a simple means of translating from a physical address to a virtual one within the kernel - simply offset the physical one by ffff880000000000. This constant is defined as PAGE_OFFSET (an alias for __PAGE_OFFSET.)

  • More generally to translate from virtual to physical addresses and vice-versa in the kernel, two functions are available - phys_to_virt() (a wrapper around __va()) and virt_to_phys() (a wrapper around __pa().)

  • Give that we have PAGE_OFFSET __va() is simple:

#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#define __pa(x) __phys_addr((unsigned long)(x))

...

#define __phys_addr(x) __phys_addr_nodebug(x)

...

static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
        unsigned long y = x - __START_KERNEL_map;

        /* use the carry flag to determine if x was < __START_KERNEL_map */
        x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));

        return x;
}
  • What is __START_KERNEL_map (which is defined as ffffffff80000000)? We can see from the memory map that this is the virtual address above which the kernel is loaded:
...
ffffffff80000000 - ffffffffa0000000 (=512 MB)  kernel text mapping, from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole

Memory Descriptors

  • Each process's memory state is represented by a 'memory descriptor', struct mm_struct, which specifies the process's PGD (see page tables), amongst a number of other fields.

  • Looking at the struct mm_struct, assuming x86-64 with a pretty standard configuration in order to strip irrelevant CONFIG_xxx fields:

struct mm_struct {
        struct vm_area_struct *mmap;            /* list of VMAs */
        struct rb_root mm_rb;
        u32 vmacache_seqnum;                   /* per-thread vmacache */

        unsigned long (*get_unmapped_area) (struct file *filp,
                                unsigned long addr, unsigned long len,
                                unsigned long pgoff, unsigned long flags);

        unsigned long mmap_base;                /* base of mmap area */
        unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
        unsigned long task_size;                /* size of task vm space */
        unsigned long highest_vm_end;           /* highest vma end address */
        pgd_t * pgd;
        atomic_t mm_users;                      /* How many users with user space? */
        atomic_t mm_count;                      /* How many references to "struct mm_struct" (users count as 1) */
        atomic_long_t nr_ptes;                  /* PTE page table pages */
        atomic_long_t nr_pmds;                  /* PMD page table pages */

        int map_count;                          /* number of VMAs */

        spinlock_t page_table_lock;             /* Protects page tables and some counters */
        struct rw_semaphore mmap_sem;

        struct list_head mmlist;                /* List of maybe swapped mm's.  These are globally strung
                                                 * together off init_mm.mmlist, and are protected
                                                 * by mmlist_lock
                                                 */

        unsigned long hiwater_rss;      /* High-watermark of RSS usage */
        unsigned long hiwater_vm;       /* High-water virtual memory usage */

        unsigned long total_vm;         /* Total pages mapped */
        unsigned long locked_vm;        /* Pages that have PG_mlocked set */
        unsigned long pinned_vm;        /* Refcount permanently increased */
        unsigned long data_vm;          /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
        unsigned long exec_vm;          /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
        unsigned long stack_vm;         /* VM_STACK */
        unsigned long def_flags;
        unsigned long start_code, end_code, start_data, end_data;
        unsigned long start_brk, brk, start_stack;
        unsigned long arg_start, arg_end, env_start, env_end;

        unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

        /*
         * Special counters, in some configurations protected by the
         * page_table_lock, in other configurations by being atomic.
         */
        struct mm_rss_stat rss_stat;

        struct linux_binfmt *binfmt;

        cpumask_var_t cpu_vm_mask_var;

        /* Architecture-specific MM context */
        mm_context_t context;

        unsigned long flags; /* Must use atomic bitops to access the bits */

        struct core_state *core_state; /* coredumping support */

        spinlock_t                      ioctx_lock;
        struct kioctx_table __rcu       *ioctx_table;

        /*
         * "owner" points to a task that is regarded as the canonical
         * user/owner of this mm. All of the following must be true in
         * order for it to be changed:
         *
         * current == mm->owner
         * current->mm != mm
         * new_owner->mm == mm
         * new_owner->alloc_lock is held
         */
        struct task_struct __rcu *owner;

        /* store ref to file /proc/<pid>/exe symlink points to */
        struct file __rcu *exe_file;

        struct mmu_notifier_mm *mmu_notifier_mm;

        /*
         * An operation with batched TLB flushing is going on. Anything that
         * can move process memory needs to flush the TLB when moving a
         * PROT_NONE or PROT_NUMA mapped page.
         */
        bool tlb_flush_pending;

        struct uprobes_state uprobes_state;

        atomic_long_t hugetlb_usage;
};

Fields

  • struct vm_area_struct *mmap - The first VMA in a linked-list of all the memory descriptor's VMAs (see below diagram for more details.)

  • struct rb_root mm_rb - Root of the red-black tree containing VMAs for fast lookup.

  • u32 vmacache_seqnum - The VMA cache uses sequence numbers stored in both the struct mm_struct and the struct task_struct to ensure that cached VMAs have not been invalidated for the running threads - if the current task's sequence number does not match the memory descriptor's, then the cache is considered invalid and flushed. Changes to the address space increment the sequence number in the struct mm_struct, and thus trigger this cache invalidation.

  • unsigned long (*get_unmapped_area)(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) - The function used to get an unmapped area with the specified parameters within the memory controlled by the descriptor. The actual function used is determined by what arch_pick_mmap_layout() has assigned to it - arch_get_unmapped_area_topdown() is used if there is no reason to avoid assigning the top-most available address (as decided by mmap_is_legacy()), or if the top-down lookup fails, otherwise an unmapped area is determined bottom-up via arch_get_unmapped_area().

  • unsigned long mmap_legacy_base - This is the minimum address to assign bottom-up mmap allocations from, determined in arch_pick_mmap_layout() - which, consists of a random ASLR offset via arch_mmap_rnd() (if randomisation is enabled for this descriptor), added to TASK_UNMAPPED_BASE. This constant is set at (page-aligned) 1/3 of the maximum userland address. All mmap's begin from this base.

  • unsigned long mmap_base - If in legacy mmap mode, then this is the same as mmap_legacy_base, otherwise it is set to the (page-aligned) maximum userland address subtracting an ASLR random factor (as above) and a gap equal to the system stack size, adjusted to fit inside the range MIN_GAP <= gap <= MAX_GAP. In non-legacy mode, this base is used as an uppermost limit to assign from. The MIN_GAP is at least around 128MiB.

  • unsigned long task_size - Set to the size of user-space address space, TASK_SIZE (128TiB - 1 page on x86-64.)

  • unsigned long highest_vm_end - The highest vm_end of any VMA in the descriptor, i.e. 1 + the maximum address covered by a VMA in the descriptor (it's an exclusive bound.)

  • pgd_t *pgd - The PGD for this process.

  • atomic_t mm_users - The reference count of processes that are using the userspace portion of memory referenced by the descriptor, typically new threads (in linux threads are processes that share a memory descriptor), or logic that needs to avoid tear down. Decremented by mmput(), which, when this count reaches 0, invokes exit_mmap() (and subsequently unmap_vmas()) to free all userspace mappings. Finally mmput() decrements mm_count via mmdrop(). Incrementing this count is simply performed via atomic_inc().

  • atomic_t mm_count - The ultimate reference count for the descriptor, with all of the mm_users counting for 1 here. The key reason there's a separate count from mm_users is that kernel threads 'borrow' the memory descriptor a userland process (see lazy TLB in the page tables section), and don't care that userland has been torn down if mm_users has reached 0, but obviously need to keep the descriptor around for the kernel mappings. mm_count is decremented via mmdrop(), and as with mm_users it is simply incremented via atomic_inc().

  • atomic_long_t nr_ptes, nr_pmds - A count of PTEs and PMDs associated with the descriptor.

  • int map_count - The number of VMAs in use in the descriptor.

  • spinlock_t page_table_lock - A general lock for page tables (though split page table locks mean PTEs and PMDs have finer-grained locking.)

  • struct rw_semaphore mmap_sem - Protects the VMA list.

  • struct list_head mmlist - Entry for list which is strung off init_mm, protected by mmlist_lock.

  • unsigned long hiwater_rss - TBD

  • unsigned long hiwater_vm - TBD

  • unsigned long total_vm - The total number of pages used by VMAs in the memory descriptor.

  • unsigned long locked_vm - The number of locked pages referenced by the memory descriptor. These pages are never swapped out.

  • unsigned long pinned_vm - The number of 'pinned' pages referenced by the memory descriptor. These pages are those which have had their refcount incremented such that they are never moved in physical or virtual memory or swapped. A 'pinned' page isn't a formal definition, rather a convention of raising the refcount permanently (see this discussion for possible future options), rather this counter was introduced to prevent double-counting.

  • unsigned long data_vm - TBD

  • unsigned long exec_vm - TBD

  • unsigned long stack_vm - TBD

  • unsigned long def_flags - A bit field which can contain only the VM_LOCKED and VM_LOCKONFAULT flags. If one or both are set, then all VMA flags will default to having these set. The former ensures mappings are not evictable (i.e. swapped out), and by necessary pre-faulted in, the latter allows the pages to be faulted in as normal, but locked once they are.

  • unsigned long start_code, end_code, start_data, end_data - The start address and exclusive end of the code and data sections of the process, i.e. start_* is the address of the first byte of these sections, and end_* is 1 byte past the last byte of these sections.

  • unsigned long start_brk, brk - The start address and exclusive end, brk, of the heap.

  • unsigned long start_stack - The start address of the stack.

  • unsigned long arg_start, arg_end, env_start, env_end - The start address and exclusive end of the process's command-line arguments and environmental variables.

  • unsigned long saved_auxv[AT_VECTOR_SIZE] - TBD

  • struct mm_rss_stat rss_stat - TBD

  • struct linux_binfmt *binfmt - TBD

  • cpumask_var_t cpu_vm_mask_var - TBD

  • mm_context_t context - Architecture-specific MMU context, though x86-64 doesn't need to store context data here, it's used to store various x86-64 specific data - mm_context_t contains VDSO, x86-64 RDPMC performance counters, a flag indicating whether 32-bit emulation mode is available, and a per-process local descriptor table for running 16-bit segmented code in e.g. DOSBox or Wine.

  • unsigned long flags - TBD

  • struct core_state *core_state - TBD

  • spinlock_t ioctx_lock (only if CONFIG_AIO) - TBD

  • struct kioctx_table *ioctx_table (only if CONFIG_AIO) - TBD

  • struct task_struct *owner (only if CONFIG_MEMCG) - The canonical owner of this memory descriptor.

  • struct file *exe_file - A reference to the file that started this process, for use in the /proc/<pid>/exe symlink.

  • struct mmu_notifier_mm *mmu_notifier_mm (only if CONFIG_MMU_NOTIFIER) - TBD

  • bool tlb_flush_pending - TBD

  • struct uprobes_state uprobes_state - TBD

  • atomic_long_t hugetlb_usage (only if CONFIG_HUGETLB_PAGE) - TBD

Initialisation

Freeing

  • A struct mm_struct is ultimately freed via free_mm() which frees the object from the slab allocator.

  • free_mm() is invoked either when mm_init() fails or in __mmdrop() which is called by mmdrop() when the descriptor's reference count, mm_count, is reduced to zero.

Virtual Memory Areas

     |------------------|      |------------------|      |------------------|
     | struct mm_struct | .... | struct mm_struct | .... | struct mm_struct |
     |------------------|      |------------------|      |------------------|
                                  mmap | ^
                                       | |
                 /---------------------/ \-----------------------\
                 |                                               |
                 v                                               | vm_mm
     |-----------------------|       vm_next         |-----------------------| vm_next
     |                       | --------------------> |                       | - - ->
     | struct vm_area_struct |       vm_prev         | struct vm_area_struct | vm_prev
     |                       | <-------------------- |                       | <- - -
     |-----------------------|                       |-----------------------|
                 ^    | vm_file                                  | vm_ops
                 |    |                                          |
                 |    |                                          v
                 |    \-----------------\        |-----------------------------|
                 |                      |        | struct vm_operations_struct |
                 |                      v        |-----------------------------|
                 |               |-------------|
                 |               | struct file |
    Other VMAs   |               |-------------|
          .      |            f_mapping |
           .     |                      |
            .    |                      v
             \   |           |----------------------|
              \  |           | struct address_space |
               \ |           |----------------------|
                \|         i_mmap |     ^      | a_ops
(red-black tree) \----------------/     |      \---------------\
                                        |                      |
                                        |                      v
                                        |     |----------------------------------|
                                        |     |  struct address_space_operations |
                                        |     |----------------------------------|
                                        |
                       -  - - --/-------X-------\-- - -  -
                                |               |
                        mapping |               | mapping
                         |-------------| |-------------|
                         | struct page | | struct page |
                         |-------------| |-------------|
  • The above diagram shows the relationship between the various process address space structures (far from exhaustively), which we'll get into below:

  • Each process's virtual memory is additionally divided into non-overlapping regions (Virtual Memory Areas or 'VMA's) related by their purpose and protection state.

  • This division exists because a process's virtual address space is necessarily sparse - only portions of the space are allocated at any one time, so it makes sense to track these regions.

  • VMAs are represented by the struct vm_area_struct type.

  • A struct mm_struct's VMAs are stored both as a doubly-linked list and a red-black tree, the head of the linked list being kept in the mm_struct's mmap field, and previous/next nodes kept in the struct vm_area_struct's vm_prev and vm_next fields, sorted in address order, and the red/black root in the mm_struct's struct rb_root mm_rb field, with the node kept in the vm_area_struct's struct rb_node vm_rb field.

  • Looking at the struct vm_area_struct (again assuming x86-64 with a pretty standard configuration in order to strip irrelevant CONFIG_xxx fields):

struct vm_area_struct {
        /* The first cache line has the info for VMA tree walking. */

        unsigned long vm_start;         /* Our start address within vm_mm. */
        unsigned long vm_end;           /* The first byte after our end address
                                           within vm_mm. */

        /* linked list of VM areas per task, sorted by address */
        struct vm_area_struct *vm_next, *vm_prev;

        struct rb_node vm_rb;

        /*
         * Largest free memory gap in bytes to the left of this VMA.
         * Either between this VMA and vma->vm_prev, or between one of the
         * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
         * get_unmapped_area find a free area of the right size.
         */
        unsigned long rb_subtree_gap;

        /* Second cache line starts here. */

        struct mm_struct *vm_mm;        /* The address space we belong to. */
        pgprot_t vm_page_prot;          /* Access permissions of this VMA. */
        unsigned long vm_flags;         /* Flags, see mm.h. */

        /*
         * For areas with an address space and backing store,
         * linkage into the address_space->i_mmap interval tree.
         */
        struct {
                struct rb_node rb;
                unsigned long rb_subtree_last;
        } shared;

        /*
         * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
         * list, after a COW of one of the file pages.  A MAP_SHARED vma
         * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
         * or brk vma (with NULL file) can only be in an anon_vma list.
         */
        struct list_head anon_vma_chain; /* Serialized by mmap_sem &
                                          * page_table_lock */
        struct anon_vma *anon_vma;      /* Serialized by page_table_lock */

        /* Function pointers to deal with this struct. */
        const struct vm_operations_struct *vm_ops;

        /* Information about our backing store: */
        unsigned long vm_pgoff;         /* Offset (within vm_file) in PAGE_SIZE
                                           units */
        struct file * vm_file;          /* File we map to (can be NULL). */
        void * vm_private_data;         /* was vm_pte (shared mem) */

        struct mempolicy *vm_policy;    /* NUMA policy for the VMA */

        struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
};
struct vm_operations_struct {
        void (*open)(struct vm_area_struct * area);
        void (*close)(struct vm_area_struct * area);
        int (*mremap)(struct vm_area_struct * area);
        int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
        int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
                                                pmd_t *, unsigned int flags);
        void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);

        /* notification that a previously read-only page is about to become
         * writable, if an error is returned it will cause a SIGBUS */
        int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);

        /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
        int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);

        /* called by access_process_vm when get_user_pages() fails, typically
         * for use by special VMAs that can switch between memory and hardware
         */
        int (*access)(struct vm_area_struct *vma, unsigned long addr,
                      void *buf, int len, int write);

        /* Called by the /proc/PID/maps code to ask the vma whether it
         * has a special name.  Returning non-NULL will also cause this
         * vma to be dumped unconditionally. */
        const char *(*name)(struct vm_area_struct *vma);

        /*
         * set_policy() op must add a reference to any non-NULL @new mempolicy
         * to hold the policy upon return.  Caller should pass NULL @new to
         * remove a policy and fall back to surrounding context--i.e. do not
         * install a MPOL_DEFAULT policy, nor the task or system default
         * mempolicy.
         */
        int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);

        /*
         * get_policy() op must add reference [mpol_get()] to any policy at
         * (vma,addr) marked as MPOL_SHARED.  The shared policy infrastructure
         * in mm/mempolicy.c will do this automatically.
         * get_policy() must NOT add a ref if the policy at (vma,addr) is not
         * marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
         * If no [shared/vma] mempolicy exists at the addr, get_policy() op
         * must return NULL--i.e., do not "fallback" to task or system default
         * policy.
         */
        struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
                                        unsigned long addr);
        /*
         * Called by vm_normal_page() for special PTEs to find the
         * page for @addr.  This is useful if the default behavior
         * (using pte_page()) would not find the correct page.
         */
        struct page *(*find_special_page)(struct vm_area_struct *vma,
                                          unsigned long addr);
};
  • Of most note here are fault(), which is called when a page fault occurs and map_pages() which maps a specified range of addresses. These both use struct vm_fault to parameterise their operation and to place certain out fields:
struct vm_fault {
        unsigned int flags;             /* FAULT_FLAG_xxx flags */
        gfp_t gfp_mask;                 /* gfp mask to be used for allocations */
        pgoff_t pgoff;                  /* Logical page offset based on vma */
        void __user *virtual_address;   /* Faulting virtual address */

        struct page *cow_page;          /* Handler may choose to COW */
        struct page *page;              /* ->fault handlers should return a
                                         * page here, unless VM_FAULT_NOPAGE
                                         * is set (which is also implied by
                                         * VM_FAULT_ERROR).
                                         */
        /* for ->map_pages() only */
        pgoff_t max_pgoff;              /* map pages for offset from pgoff till
                                         * max_pgoff inclusive */
        pte_t *pte;                     /* pte entry associated with ->pgoff */
};
const struct vm_operations_struct generic_file_vm_ops = {
        .fault          = filemap_fault,
        .map_pages      = filemap_map_pages,
        .page_mkwrite   = filemap_page_mkwrite,
};
struct address_space {
        struct inode            *host;          /* owner: inode, block_device */
        struct radix_tree_root  page_tree;      /* radix tree of all pages */
        spinlock_t              tree_lock;      /* and lock protecting it */
        atomic_t                i_mmap_writable;/* count VM_SHARED mappings */
        struct rb_root          i_mmap;         /* tree of private and shared mappings */
        struct rw_semaphore     i_mmap_rwsem;   /* protect tree, count, list */
        /* Protected by tree_lock together with the radix tree */
        unsigned long           nrpages;        /* number of total pages */
        /* number of shadow or DAX exceptional entries */
        unsigned long           nrexceptional;
        pgoff_t                 writeback_index;/* writeback starts here */
        const struct address_space_operations *a_ops;   /* methods */
        unsigned long           flags;          /* error bits/gfp mask */
        spinlock_t              private_lock;   /* for use by the address_space */
        struct list_head        private_list;   /* ditto */
        void                    *private_data;  /* ditto */
}
  • Note the struct rb_root field i_mmap - this provides a red-black tree root listing private and shared VMAs which map the file, including the VMA that references the address space via vm_file->f_mapping.

  • The operations a struct address_space needs to perform are provided via its struct address_space_operations a_ops field, for example reading/writing pages from/to the file, etc.:

struct address_space_operations {
        int (*writepage)(struct page *page, struct writeback_control *wbc);
        int (*readpage)(struct file *, struct page *);

        /* Write back some dirty pages from this mapping. */
        int (*writepages)(struct address_space *, struct writeback_control *);

        /* Set a page dirty.  Return true if this dirtied it */
        int (*set_page_dirty)(struct page *page);

        int (*readpages)(struct file *filp, struct address_space *mapping,
                        struct list_head *pages, unsigned nr_pages);

        int (*write_begin)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned flags,
                                struct page **pagep, void **fsdata);
        int (*write_end)(struct file *, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned copied,
                                struct page *page, void *fsdata);

        /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
        sector_t (*bmap)(struct address_space *, sector_t);
        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
        int (*releasepage) (struct page *, gfp_t);
        void (*freepage)(struct page *);
        ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
        /*
         * migrate the contents of a page to the specified target. If
         * migrate_mode is MIGRATE_ASYNC, it must not block.
         */
        int (*migratepage) (struct address_space *,
                        struct page *, struct page *, enum migrate_mode);
        int (*launder_page) (struct page *);
        int (*is_partially_uptodate) (struct page *, unsigned long,
                                        unsigned long);
        void (*is_dirty_writeback) (struct page *, bool *, bool *);
        int (*error_remove_page)(struct address_space *, struct page *);

        /* swapfile support */
        int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                                sector_t *span);
        void (*swap_deactivate)(struct file *file);
};
  • Non-anonymously mapped struct page's which represent the underlying physical pages of memory mapped to a struct address_space reference it via their mapping field (if & only if the low bit of the mapping field is clear.)

Page Faulting

  • A page fault occurs when either a virtual memory address is not mapped, or it is and no physical page is actually mapped to the address (i.e. pte_present() returns 0 on the mapping's PTE), or a write is attempted on a read-only mapping (i.e. pte_write() returns 0 on the mapping's PTE) or when userland tries to access a kernel mapping.

  • This isn't necessary an error, in fact modern operating systems (including linux, naturally) use page faulting for a number of different mechanisms:

  1. Swap - 'Swapped out' memory is data that has been written to the disk so that RAM can be used for other purposes. The memory is marked non-present, then when a fault occurs the data can be read from disk and placed in memory for the referencing process to access.

  2. Demand paging - Under demand paging, physical pages are not actually mapped until they are used. This allows for a far more efficient means of allocation of memory compared to the case where memory must be fully allocated as soon as it is requested - as soon as the memory is actually used in some way, a page fault occurs and the kernel allocates the memory.

  3. Copy-on-write semantics - Since page faults occur when a read-only page of memory is attempted to be written to, this can be used to significantly improve the efficiency of a fork. Rather than copying the allocated pages of the parent process, the pages can be marked read-only in both parent and child, and as soon as one writes to a page the arising fault triggers a copy of the page. Since most process's pages of memory are only read, this hugely increases the efficiency of forks.

Handling Page Faults

  • Under x86, the kernel can specify an interrupt handler for page faults which is invoked by the CPU when one occurs. When the handler function is invoked, an error code describing the cause of the fault is pushed to the stack and the cr2 control register is set to the linear address that caused the fault.

  • The linux page fault handler is do_page_fault(). It retrieves the error code and address before handing off the heavy lifting to __do_page_fault().

  • __do_page_fault() performs a number of checks to handle the error cases, before handing off to the non-arch specific handle_mm_fault() if the fault looks valid. Diagrammatically (simplifying the process somewhat, ignoring kmemcheck and kprobes, some special case fault support, and vsyscall emulation - better to look at the code for those cases):

                                       -------------------
                                       |   CPU invokes   |
                                       | do_page_fault() |
                                       -------------------
                                                |
                                                v
                                    ------------------------
                                    | Retrieve address and |
                                    | error code then call |
                                    |   __do_page_fault()  |
                                    ------------------------
                                                |
                                                v
                                   /------------------------\
                                  /  Did we fault in kernel  \---\
                                  \          space?          /   | Yes
                                   \------------------------/    |
                                      | No                       v
                                      v                 /-----------------\ Yes
   ---------               Yes /-------------\         /  vmalloc_fault()  \----\
   | OOPS! |<-----------------/ Reserved bits \        \      succeed?     /    |
   ---------                  \   modified?   /         \-----------------/     |
                               \-------------/                   | No           |
                                      | No                       v              |
                                      v                    /-----------\ Yes    |
                          Yes /--------------\            /  Was fault  \-------\
   /-------------------------/      SMAP      \           \  spurious?  /       |
   |                         \   violation?   /            \-----------/        |
   |                          \--------------/                   | No           |
   |                                  | No                       |              |
   /----------------------------------(--------------------------/              |
   |                                  v                                         |
   |                  Yes /-----------------------\                             |
   /---------------------/  Fault handler disabled \                            |
   |                     \    or !current->mm?     /                            |
   |                      \-----------------------/                             |
   |                                  | No                                      |
   |                                  v                                         |
   |                         --------------------                               |
   |                         | Find nearest VMA |<--------\                     |
   |                         --------------------         |                     |
   |                                  |                   |                     |
   |                                  v                   |                     |
   |  No /-------------\  No /----------------\           |                     |
   /----/  Is region a  \<--/     Does VMA     \          | No                  |
   |    \     stack?    /   \ contain address? /    /------------\ Yes          |
   |     \-------------/     \----------------/    / Fatal signal \--\          |
   |            | Yes                 | Yes        \   pending?   /  |          |
   |            v                     |             \------------/   |          |
   | Yes /--------------\             |                   ^          |          |
   /----/  Is address <  \            |                   |          |          |
   |    \ stack pointer? /            |                   |          |          |
   |     \--------------/             |                   |          |          |
   |            | No                  |                   |          |          |
   |            v                     v             --------------   |          |
   | No /---------------\ Yes   /-----------\       | Mark retry |   |          |
   /---/  expand_stack() \---> / Permissions \      | disallowed |   |          |
   |   \    succeed?     /     \     OK?     /      --------------   |          |
   |    \---------------/       \-----------/             ^          |          |
   |                         No   |   | Yes               |          |          |
   /------------------------------/   |                   |          |          |
   |                                  v                   |          |          |
   |                        ---------------------         | Yes      |          |
   |                        |        Call       |     /--------\ No  |          |
   |                        | handle_mm_fault() |    /  Retry   \----\          |
   |                        ---------------------    \ allowed? /    |          |
   |                                  |               \--------/     |          |
   |                                  v                   ^          |          |
   |                          /---------------\  Yes      |          v          |
   |                         / Return Value &  \----------/     /---------\ Yes |
   |                         \ VM_FAULT_RETRY? /               /   From    \----\
   |                          \---------------/                \ userland? /    |
   |                                  | No                      \---------/     |
   |                                  v                              | No       |
   |  ------------------  Yes /---------------\                      |          |
   |  | mm_fault_error |<----/ Return value &  \                     |          |
   |  ------------------     \ VM_FAULT_ERROR? /                     |          |
   |           |              \---------------/                      |          |
   |           |                      | No                           |          |
   |           |                      v                              |          |
   |           |              ******************                     |          |
   |           |              * FAULT COMPLETE *<--------------------)----------/
   |           |              ******************                     |
   |           |                                                     |
   |           \---------------\                                     |
   |                           |                                     |
   |                           v                                     |
   |                /--------------------\ Yes                       |
   |               / Fatal signal pending \--------------------------\
   |               \   in kernel mode?    /                          |
   |                \--------------------/                           |
   |                           | No                                  |
   |                           v                                     |
   |                   /--------------\ Yes     /---------\ Yes      |
   |                  /     fault &    \------>/ In kernel \---------\
   |                  \  VM_FAULT_OOM? /       \   mode?   /         |
   |                   \--------------/         \---------/          |
   |                           | No                  | No            |
   |                           v                     v               |
   |  ----------  Yes /-----------------\     --------------         |
   |  |  Send  |<----/ fault & VM_FAULT_ \    | Invoke OOM |         |
   |  | SIGBUS |     \ SIGBUS/HWPOISON*? /    |   killer   |         |
   |  ----------      \-----------------/     --------------         |
   |                           | No                                  |
   |                           v                                     |
   |                  /-----------------\ No  ----------             |
   |                 /      fault &      \--->| BUG()! |             |
   |                 \  VM_FAULT_SIGSEGV /    ----------             |
   |                  \-----------------/                            |
   |                           | Yes                                 |
   |                           v                                     v
   |             -----------------------------               -----------------
   \------------>| __bad_area_no_semaphore() |       /------>|  no_context() |
                 -----------------------------       |       -----------------
                               |                     |               |
                               v                     |               v
                  /------------------------\ Yes     |     /------------------\ No
                 /  Did we fault in kernel  \--------/    / Is there a kernel  \--\
                 \          space?          /             \ exception handler? /  |
                  \------------------------/               \------------------/   |
                               | No                                  | Yes        |
                               v                                     v            |
                    /-------------------\ Yes               ------------------    |
                   / Attempted access of \-----------\      | Call exception |    |
                   \    kernel memory?   /           |      |     handler    |    |
                    \-------------------/            |      ------------------    |
                               | No                  |                            |
                               v                     v                            |
                       ----------------     -------------------   ---------       |
                       | Send SIGSEGV |<--- | Mark protection |   | OOPS! |<------/
                       ----------------     |      fault      |   ---------
                                            -------------------

Kernel faults

  • Page faults in the kernel are not permitted, except for the case of vmalloc memory (will discuss this in a later section), which is used to allow for virtually contiguous memory.

Page Fault Error Code

  • The error code specified by x86 consists of a bitfield of enum x86_pf_error_code providing information on the cause of the page fault:
/*
 * Page fault error code bits:
 *
 *   bit 0 ==    0: no page found       1: protection fault
 *   bit 1 ==    0: read access         1: write access
 *   bit 2 ==    0: kernel-mode access  1: user-mode access
 *   bit 3 ==                           1: use of reserved bit detected
 *   bit 4 ==                           1: fault was an instruction fetch
 *   bit 5 ==                           1: protection keys block access
 */
enum x86_pf_error_code {

        PF_PROT         =               1 << 0,
        PF_WRITE        =               1 << 1,
        PF_USER         =               1 << 2,
        PF_RSVD         =               1 << 3,
        PF_INSTR        =               1 << 4,
        PF_PK           =               1 << 5,
};
  • This is used by __do_page_fault() and the functions it invokes to handle cases correctly. Note that it enables the code to determine whether the access was made from userland or kernel mode, and given we know the address we can determine whether it is a kernel address or userland one.

Generic Page Table Fault Handling

  • As shown in the flow chart above, an initial check is made against a number of error states, before the generic (i.e. non-arch specific) handle_mm_fault() is called.

  • __handle_mm_fault() does the heavy lifting, allocating page tables as necessary.

  • It's interesting to note here that in fact page table mappings might not exist for valid addresses. Looking back to the flow chart, we can see that we don't differentiate between a page fault caused by invalid virtual address (i.e. no page table mappings for that address) or one where the present bit is not set in the PTE.

  • Instead, we use the VMA to determine if the address is valid (with a special case for stacks.) This means that large blocks of memory can not only fault-in the physical pages of memory when requested, but also the page tables required to map those pages.

  • In practice, it seems that multi-page allocations result in the first and last pages being mapped and those in the middle not being. Experiment with the multi-page alloc sample code and the pagetables hack, both from the sister repo linux-vm-hacks to explore this on a local linux system.

  • Physical page allocation is handled via handle_pte_fault(). Firstly, the function needs to handle cases where the PTE is not present:

  1. If the PTE is empty and the VMA is anonymous i.e. not mapping a file, do_anonymous_page() handles allocating an anonymous page.

  2. If the PTE is empty and the VMA is not anonymous, i.e. mapping a file, do_fault() handles allocating the page.

  3. Finally, if the PTE is not present but also not empty, then do_swap_page() is invoked to handle the swapping.

  • In each of the above cases, the relevant function is returned and the rest of the function does not run.

Page Fault Types

  • Page faults are divided into 3 types - minor/soft, major/hard and error, the latter two cases represented by VM_FAULT_MAJOR and VM_FAULT_ERROR respectively (VM_FAULT_ERROR is a bitmask of error states.)

  • Minor (or soft) page faults are those where either memory simply needs to be faulted in or a CoW copy needs to be made.

  • Major (or hard) page faults are those that require I/O - i.e. a page needs to be swapped back into memory, or a file mapping needs to be flushed or read from.