Skip to content
This repository has been archived by the owner on Dec 3, 2022. It is now read-only.

Latest commit

 

History

History
807 lines (566 loc) · 28.8 KB

funcs.md

File metadata and controls

807 lines (566 loc) · 28.8 KB

General Functions

Contents

Address Translation

Utility Functions

  • virt_addr_valid() - Determines if a specified virtual address is a valid kernel virtual address.
  • pfn_valid() - Determines if a specified Page Frame Number (PFN) represents a valid physical address.
  • pte_same() - Determines whether the two specified PTE entries refer to the same physical page and flags.

TLB


Function Descriptions

phys_to_virt()

void *phys_to_virt(phys_addr_t address)

phys_to_virt() translates physical address address to a kernel-mapped virtual one.

Wrapper around __va().

Arguments

  • address - Physical address to be translated.

Returns

Kernel-mapped virtual address.


virt_to_phys()

phys_addr_t virt_to_phys(volatile void *address)

virt_to_phys() translate kernel-mapped virtual address address to a physical one.

Wrapper around __pa().

Arguments

  • address - Kernel-mapped virtual address to be translated.

Returns

Physical address associated with specified virtual address.


__va()

void *__va(phys_addr_t address)

__va() does the heavy lifting for phys_to_virt(), converting a physical memory address to a kernel-valid virtual one.

This function simply adds the kernel memory offset PAGE_OFFSET, 0xffff880000000000 for x86-64. Looking at the x86-64 memory map, you can see this simply provides a virtual address that is part of the kernel's complete direct mapping of all physical memory between 0xffff880000000000 and 0xffffc7ffffffffff.

The function performs no checks and assumes the supplied physical memory address is valid and in the 64TiB of allowed memory in current x86-64 linux kernel implementations.

NOTE: Macro, inferring function signature.


__pa()

phys_addr_t __pa(volatile void *address)

__pa() does the heavy lifting for virt_to_phys(), converting a kernel virtual memory address to a physical one. The function is a wrapper around __phys_addr().

The function isn't quite as simple as __va() as it has to determine whether the supplied virtual address is part of the kernel's direct mapping of all of the physical memory between 0xffff880000000000 and 0xffffc7ffffffffff, or whether it's part of the kernel's 'text' from __START_KERNEL_map on (0xffffffff80000000 for x86-64), and offsets the supplied virtual address accordingly.

In the case of the address originating from the kernel's 'text', the physical offset of the kernel, phys_base is taken into account, which allows for the kernel to be loaded in a different physical location e.g. when kdumping.

NOTE: Macro, inferring function signature.


page_address()

void *page_address(const struct page *page)

page_address() determines the virtual address of the specified physical struct page.

In x86-64 the implementation is straightforward and provided by lowmem_page_address() (as we have no high memory to worry about.) We simply obtain the PFN of the page via page_to_pfn(), translate it to a physical page via PFN_PHYS() and return the kernel-mapped virtual address via __va().

Arguments

  • page - The physical page whose virtual address we desire.

Returns

The virtual address mapped to the specified physical struct page.


pmd_to_page()

struct page *pmd_to_page(pmd_t *pmd)

pmd_to_page() returns the struct page describing the physical page the PMD page the PMD entry pmd resides in.

The function works by masking out the offset of the entry within the page using the mask ~(PTRS_PER_PMD * sizeof(pmd_t) - 1) which ultimately page-aligns the virtual address.

Note that we're dealing with the address of the entry, utterly ignoring the entry's contents, so the virtual address refers to the PMD page rather than what the PMD entry is pointing at.

Finally, the function returns the struct page associated with the PMD page via virt_to_page().

The function is used by the split page table lock functionality in the kernel.

Arguments

  • pmd - The PMD entry whose PMD page we want to obtain the struct page to.

Returns

The struct page describing the PMD page the specified PMD entry belongs to.


virt_to_page()

struct page *virt_to_page(unsigned long kaddr)

virt_to_page() determines the physical address of the specified kernel virtual address, then the Page Frame Number (PFN) of the physical page that contains it, and finally passes this to pfn_to_page() to retrieve the struct page which describes the physical page.

Important: As per the code comment above the function, the returned pointer is valid if and only if virt_addr_valid(kaddr) returns true.

NOTE: Macro, inferring function signature.

Arguments

  • kaddr - The virtual kernel address whose struct page we desire.

Returns

The struct page describing the physical page the specified virtual address resides in.


page_to_pfn()

unsigned long page_to_pfn(struct page *page)

page_to_pfn() returns the Page Frame Number (PFN) that is associated with the specified struct page.

The PFN of a physical address is simply the (masked) address's value shifted right by the number of bits of the page size, so in a standard x86-64 configuration, 12 bits (equivalent to the default 4KiB page size), and pfn = masked_phys_addr >> 12.

How the PFN is determined varies depending on the memory model, in x86-64 UMA this is __page_to_pfn() under CONFIG_SPARSEMEM_VMEMMAP - the memory map is virtually contiguous at vmemmap, (0xffffea0000000000, see x86-64 memory map.)

This makes the implementation of the function straightforward - simply subtract vmemmap from the page pointer (being careful with typing to have pointer arithmetic take into account sizeof(struct page) for you.)

NOTE: Macro, inferring function signature.

Arguments

  • page - The struct page whose corresponding Page Frame Number (PFN) is desired.

Returns

The PFN of the specified struct page.


page_to_phys()

dma_addr_t page_to_phys(struct page *page)

page_to_phys() returns a physical address for the page described by the specified struct page.

Oddly it seems to return a dma_addr_t, possibly due to use for device I/O (it is declared in arch/x86/include/asm/io.h), however in x86-64 it makes no difference as dma_addr_t and phys_addr_t are the same - unsigned long (64-bit.)

NOTE: Macro, inferring function signature.

Arguments

  • page - The struct page whose physical start address we desire.

Returns

The physical address of the start of the page which the specified struct page describes.


pfn_to_page()

struct page *pfn_to_page(unsigned long pfn)

pfn_to_page() returns the struct page that is associated with the specified Page Frame Number (PFN.)

The PFN of a physical address is simply the (masked) address's value shifted right by the number of bits of the page size, so in a standard x86-64 configuration, 12 bits (equivalent to the default 4KiB page size), and pfn = masked_phys_addr >> 12.

How the struct page is located varies depending on the memory model, in x86-64 UMA this is __pfn_to_page() under CONFIG_SPARSEMEM_VMEMMAP - the memory map is virtually contiguous at vmemmap, (0xffffea0000000000, see x86-64 memory map.)

This makes the implementation of the function straightforward - simply offset the PFN by vmemmap (being careful with typing to have pointer arithmetic take into account sizeof(struct page) for you.)

NOTE: Macro, inferring function signature.

Arguments

  • pfn - The Page Frame Number (PFN) whose corresponding struct page is desired.

Returns

The struct page that describes the physical page with specified PFN.

virt_addr_valid()

bool virt_addr_valid(unsigned long kaddr)

virt_addr_valid() determines if the specified virtual address kaddr is actually a valid, non-vmalloc'd kernel address.

The function is a wrapper for __virt_addr_valid(), which, once its checked the virtual address is in a valid range, checks it has a valid corresponding physical PFN via pfn_valid().

NOTE: Macro, inferring function signature.

Arguments

  • kaddr - Virtual address which we want to determine is a valid non-vmalloc'd kernel address or not.

Returns

true if the address is valid, false if not.


pfn_valid()

int pfn_valid(unsigned long pfn)

pfn_valid() determine whether the specified Page Frame Number (PFN) is valid, i.e. in x86-64 whether it refers to a valid 46-bit address, and whether there is actually physical memory mapped to that physical location.

NOTE: Macro, inferring function signature.

Arguments

  • pfn - PFN whose validity we wish to determine.

Returns

Truthy (non-zero) if the PFN is valid, 0 if not.


pte_same()

int pte_same(pte_t a, pte_t b)

pte_same() determines whether the two specified PTE entries refer to the same physical page AND share the same flags.

On x86-64 it's as simple as a.pte == b.pte.

Arguments

  • a - The first PTE entry whose physical page address and flags we want to compare.

  • b - The second PTE entry whose physical page address and flags we want to compare.

Returns

1 if the PTE entries' physical address and flags are the same, 0 if not.


flush_tlb()

void flush_tlb(void)

flush_tlb() is a wrapper around flush_tlb_current_task() which flushes the current struct mm_struct TLB mappings.

flush_tlb_current_task() checks whether any other CPUs use the current struct mm_struct, and if so invokes flush_tlb_others() to flush the TLB entries for those CPUs too.

Ultimately the flushing is performed by local_flush_tlb() which is a wrapper around __flush_tlb() which itself wraps __native_flush_tlb() which flushes the CPU's TLB by simply reading and writing back the contents of the cr3 register.

NOTE: Macro, inferring function signature.

Arguments

N/A

Returns

N/A


flush_tlb_all()

void flush_tlb_all(void)

flush_tlb_all() flushes the TLB entries for all processes on all CPUs.

Note that this function causes mappings that have _PAGE_GLOBAL to be evicted also, using the invpcid instruction on modern CPUs (via __flush_tlb_all(), __flush_tlb_global() and subsequently invpcid_flush_all().)

Arguments

N/A

Returns

N/A


flush_tlb_mm()

void flush_tlb_mm(struct mm_struct *mm)

flush_tlb_mm() is simply a wrapper around flush_tlb_mm_range() specifying that the whole memory address space mm describes is to be flushed, with no flags specified.

See the description of flush_tlb_mm_range() below for a description of the implementation of the flush.

NOTE: Macro, inferring function signature.

Arguments

Returns

N/A


flush_tlb_page()

void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)

flush_tlb_page() flushes a single page's TLB mapping at the specified address.

If struct vm_area_struct specified by vma does not refer to the active struct mm_struct, then the operation is a no-op on this CPU, as there's nothing to flush.

If the current process is a kernel thread (i.e. current->mm == NULL), which indicates the current CPU is in a lazy TLB mode, we invoke leave_mm() to switch out the struct mm_struct we are 'borrowing' and we're done.

Otherwise, ultimately the invlpg instruction is invoked (ignoring the paravirtualised case where a hypervisor function is called directly) via __flush_tlb_one(), __flush_tlb_single(), and __native_flush_tlb_single().

Intel's documentation on the invlpg instruction indicates that the specified address does not need to be page aligned, and in the case of pages larger than 4KiB in size with multiple TLBs for that page, all will be safely flushed. Additionally, under certain circumstances more or even all TLB entries may be flushed, however this is presumably unlikely.

Arguments

  • vma - The struct vm_area_struct which contains the struct mm_struct which in turn describes the page whose TLB entry we want to flush.

  • start - A virtual address contained in the page we want to flush.

Returns

N/A


flush_tlb_range()

void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
                     unsigned long end)

flush_tlb_range() is a wrapper around flush_tlb_mm_range(), it uses the struct mm_struct belonging to the specified struct vm_area_struct vma, i.e. vma->vm_mm, as well as the vma's flags, i.e. vma->vm_flags and simply passes these on:

#define flush_tlb_range(vma, start, end)        \
                flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)

See flush_tlb_mm_range() below for more details as to how the range flush is achieved.

NOTE: Macro, inferring function signature.

Arguments

  • vma - The struct vm_area_struct which contains the range of addresses we wish to TLB flush.

  • start - The start of the range of virtual addresses we want to TLB flush.

  • end - The exclusive upper bound of the range of virtual addresses we wish to TLB flush.

Returns

N/A


flush_tlb_mm_range()

void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
                        unsigned long end, unsigned long vmflag)

flush_tlb_mm_range() causes a TLB flush to occur in the virtual address range specified (note that end is an exclusive bound.)

If the current process's struct task_struct's active_mm field doesn't match the specified mm argument, there's nothing to do so we exit.

If the current process is a kernel thread (i.e. current->mm == NULL), which indicates the current CPU is in a lazy TLB mode, we invoke leave_mm() to switch out the struct mm_struct we are 'borrowing' and we're done.

If the vmflag flag field has its VM_HUGETLB bit set, i.e. the page range includes huge pages, or the range is specified to be a full flush range (i.e. end == TLB_FLUSH_ALL), then a full flush is performed.

Otherwise, the number of pages to flush is compared to the tlb_single_page_flush_ceiling variable, which is set at 33 by default and tunable via /sys/kernel/debug/x86/tlb_single_page_flush_ceiling. If this value is set to 0, page-granularity flushes are not performed at all.

If the number of pages to flush exceeds this value, then a full flush is performed instead. The Documentation/x86/tlb.txt documentation goes into more detail as to the trade off between a full flush and individual flushes, and the code comment above this value explains:

/*
 * See Documentation/x86/tlb.txt for details.  We choose 33
 * because it is large enough to cover the vast majority (at
 * least 95%) of allocations, and is small enough that we are
 * confident it will not cause too much overhead.  Each single
 * flush is about 100 ns, so this caps the maximum overhead at
 * _about_ 3,000 ns.
 *
 * This is in units of pages.
 */

If any of the conditions for a full flush are met, local_flush_tlb() is called.

Otherwise, each page is flushed via __flush_tlb_single(), and ultimately the x86 invlpg instruction is invoked to perform the page-level flushing, unless paravirtualisation (e.g. xen) is in place in which case a hypervisor function is called directly.

If the struct mm_struct is in use by other CPUs, flush_tlb_others() is invoked to perform a TLB flush for those CPUs also.

Arguments

  • mm - The struct mm_struct which contains the range of addresses we wish to TLB flush.

  • start - The start of the range of virtual addresses we want to TLB flush.

  • end - The exclusive upper bound of the range of virtual addresses we wish to TLB flush.

  • vmflag - The struct vm_area_struct flags associated with this region of memory, used only to determine if VM_HUGETLB is set.

Returns

N/A


flush_tlb_kernel_range()

void flush_tlb_kernel_range(unsigned long start, unsigned long end)

flush_tlb_kernel_range() TLB flushes the specified range of kernel virtual addresses.

If the end argument is set to TLB_FLUSH_ALL or the specified address range exceeds the tlb_single_page_flush_ceiling variable (see description of flush_tlb_mm_range() above for more details on this), a global flush is performed on each CPU via __flush_tlb_all(). This is necessary as kernel mappings are marked _PAGE_GLOBAL.

Otherwise, individual pages are flushed one-by-one via __flush_tlb_single().

Arguments

  • start - The start of the range of virtual kernel addresses we want to TLB flush.

  • end - The exclusive upper bound of the range of virtual kernel addresses we wish to TLB flush.

Returns

N/A


flush_tlb_others()

void flush_tlb_others(const struct cpumask *cpumask,
                      struct mm_struct *mm, unsigned long start,
                      unsigned long end)

flush_tlb_others() flushes the TLB for CPUs other than the one invoking the call. It's used by the other TLB flush functions to ensure that TLB flushes are performed across all pertinent CPUs (i.e. each CPU which reference a struct mm_struct or in the case of full flushes, all CPUs.)

Other CPUs are made to performing a flush via Inter-Processor Interrupts (IPIs) using smp_call_function_many() each of which invoke flush_tlb_func().

flush_tlb_func() operates as follows:

If the current->active_mm struct mm_struct is not the same as the one requested to be flushed, the function exits. Additionally, if the CPU TLB state is set to lazy TLB, leave_mm() is invoked to switch out the struct mm_struct we are 'borrowing' and we're done.

Otherwise, it is checked whether a full flush is requested, if so this is performed via local_flush_tlb(). If a full flush is not requested, individual pages are evicted via __flush_tlb_single(). Interestingly, no check against tlb_single_page_flush_ceiling is performed, presumably because flush_tlb_others() seems only to be used by other manual TLB flushing functions and presumably they would have already checked for this.

flush_tlb_func() has some egregious timing concerns because of the invocation of an IPI and the possibility of a struct mm_struct being switched out from under the function via switch_mm(), as described by its comment:

/*
 * The flush IPI assumes that a thread switch happens in this order:
 * [cpu0: the cpu that switches]
 * 1) switch_mm() either 1a) or 1b)
 * 1a) thread switch to a different mm
 * 1a1) set cpu_tlbstate to TLBSTATE_OK
 *      Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
 *      if cpu0 was in lazy tlb mode.
 * 1a2) update cpu active_mm
 *      Now cpu0 accepts tlb flushes for the new mm.
 * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
 *      Now the other cpus will send tlb flush ipis.
 * 1a4) change cr3.
 * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
 *      Stop ipi delivery for the old mm. This is not synchronized with
 *      the other cpus, but flush_tlb_func ignore flush ipis for the wrong
 *      mm, and in the worst case we perform a superfluous tlb flush.
 * 1b) thread switch without mm change
 *      cpu active_mm is correct, cpu0 already handles flush ipis.
 * 1b1) set cpu_tlbstate to TLBSTATE_OK
 * 1b2) test_and_set the cpu bit in cpu_vm_mask.
 *      Atomically set the bit [other cpus will start sending flush ipis],
 *      and test the bit.
 * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
 * 2) switch %%esp, ie current
 *
 * The interrupt must handle 2 special cases:
 * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
 * - the cpu performs speculative tlb reads, i.e. even if the cpu only
 *   runs in kernel space, the cpu could load tlb entries for user space
 *   pages.
 *
 * The good news is that cpu_tlbstate is local to each cpu, no
 * write/read ordering problems.
 */

/*
 * TLB flush funcation:
 * 1) Flush the tlb entries if the cpu uses the mm that's being flushed.
 * 2) Leave the mm if we are in the lazy tlb mode.
 */

NOTE: Macro, inferring function signature.

Arguments

  • cpumask - The mask representing the CPUs which are to be TLB flushed.

  • mm - The struct mm_struct which contains the range of addresses we wish to TLB flush.

  • start - The start of the range of virtual addresses we want to TLB flush.

  • end - The exclusive upper bound of the range of virtual addresses we wish to TLB flush.

Returns

N/A