-
The kernel can only directly address memory for which it has set up a page table entry. In the common case, the user/kernel address space split of 3GiB/1GiB implies that, at best, only 896MiB of memory may be directly accessed at any given time on a 32-bit machine (see 4.1)
-
On 64-bit hardware this is not an issue as there are vast amounts of virtual address space, however 32-bit is another story.
-
On 32-bit hardware Linux temporarily maps pages from high memory into the lower page tables (discussed further in 9.2.)
-
In I/O operations, not all devices are able to address high memory or even all the memory available to the CPU (consider PAE.) Asking the device to write to memory will fail at best and fuck the kernel at worst. To work around this issue we can use a 'bounce buffer' which is discussed in 9.5.
-
The 'Persistent Kernel Map' (PKMap) address space is reserved at the top of the kernel page tables from PKMAP_BASE to FIXADDR_START.
-
On i386,
PKMAP_BASE
is0xfe000000
, andFIXADDR_START
varies depending on configuration options, but is typically only a few pages from the end of the linear address space - this leaves us around 32MiB of page table space for mapping pages from high memory into usable space. -
For mapping pages, a single page set of PTEs is stored at the beginning of the PKMap area to allow 1,024 high pages to be mapped into low memory for short periods via kmap() and to be unmapped with kunmap().
-
The pool of 1,024 pages (4MiB) seems very small, however pages mapped by
kmap()
are only mapped for a very short period of time. -
Comments in the code suggest that there was a plan to allocate contiguous page table entries to expand this area however this hasn't been performed in 2.4.22, so a large portion of the PKMap is unused.
-
The page table entry for use with
kmap()
is pkmap_page_table and is located at PKMAP_BASE which is set up during system initialisation (on i386 this takes place at the end of pagetable_init(). -
The pages for the PGD and PMD entries are allocated by the boot memory allocator to ensure they exist.
-
The state of these page table entries is managed by pkmap_count, which has LAST_PKMAP entries in it - on i386 without PAE this is 1,024 and with PAE it is 512. More accurately, though the code doesn't make it clear,
LAST_PKMAP
is equivalent to PTRS_PER_PTE. -
Each element in
pkmap_count
is not exactly a reference count, but is very close - if the entry is 0, the page is free and has not been used since the last TLB flush. If it's 1, the slot is unused, but a page is still mapped there waiting for a TLB flush, and finally if it's any higher it hasn - 1
users, takingn
to be the element value. -
Flushes are delayed until every slot has been used at least once because a global flush is required for all CPUs when the global page tables are modified and this is extremely expensive.
- Let's take a look at the API for mapping pages from high memory:
-
kmap() - Takes a struct page from high memory and maps it into low memory, returning a virtual address of the mapping.
-
kmap_nonblock() - Same as
kmap()
only it won't block if slots are not available and instead returnsNULL
. This isn't the same askmap_atomic()
which uses specially reserved slots. -
kmap_atomic() - This uses slots maintained in the map for atomic use by interrupts (see 9.4 for more details), however their use is highly discouraged and callers may not sleep or schedule. You really have to require atomic high memory mapping to use this.
-
The kmap pool is rather small so it's vital that callers call kunmap() as quickly as possible, as the pressure on this small window grows incrementally worse as the size of high memory grows in comparison to low memory.
-
kmap()
itself is fairly simple:
-
Check to ensure an interrupt is not calling the function (as it may sleep), calling out_of_line_bug() if so - we call this rather than BUG() because the latter would panic the system if called from an interrupt handler (LS - it seems
out_of_line_bug()
callsBUG()
though??) -
Check that the page is below highmem_start_page, because pages below this level are already visible and do not need to be mapped.
-
Check whether the page is already in low memory, if so simply return the address. This step is important - it means users can use it unconditionally knowing that if it is already a low memory page the function can still be used safely.
-
Finally call kmap_high() to do the real work.
kmap_high()
(and subsequently map_new_virtual() does the following:
-
Check the
page->virtual
field, which is set if the page is already mapped. If it'sNULL
, map_new_virtual() provides a mapping for the page. -
map_new_virtual()
simply linearly scans pkmap_count, starting at last_pkmap_nr to avoid searching the same areas repeatedly betweenkmap()
s. -
When
last_pkmap_nr
wraps around to 0, flush_all_zero_pkmaps() sets all entries from 1 to 0 before flushing the TLB. -
If, after another scan, an entry is still not found, the process sleeps on the pkmap_map_wait wait queue until it is woken up after the next kunmap().
-
After a mapping has been created, the corresponding entry in the
pkmap_count
array is incremented, and the virtual address in low memory is returned.
- Taking a look at the API for unmapping pages from high memory:
-
kunmap() - Unmaps a struct page from low memory and frees up the page table entry that was mapping it.
-
kunmap_atomic() - Unamps a page that was mapped atomically via kmap_atomic().
-
kunmap()
, similar to kmap(), checks that the caller is not from interrupt context and that the memory is below highmem_start_page, before calling kunmap_high() to do the real work of unmapping the page. -
kunmap_high()
decrements the corresponding element for the page in pkmap_count. If it reaches 1 (no users but a TLB flush required), any process waiting on the pkmap_map_wait wait queue is woken up because a slot is now available. -
kunmap_high()
does not unmap the page from the page tables because that would require an expensive TLB flush, this is delayed until flush_all_zero_pkmaps() is called.
-
The use of kmap_atomic() is discouraged, however slots are reserved for each CPU for when they are necessary such as when 'bounce buffers' are used by devices from interrupt.
-
There are a varying number of different requirements an architecture has for atomic high memory mapping, which are enumerated by enum km_type:
enum km_type {
KM_BOUNCE_READ,
KM_SKB_SUNRPC_DATA,
KM_SKB_DATA_SOFTIRQ,
KM_USER0,
KM_USER1,
KM_BH_IRQ,
KM_SOFTIRQ0,
KM_SOFTIRQ1,
KM_TYPE_NR
};
-
As declared here there are
KM_TYPE_NR
total number of uses, which on i386 is 8 (LS - book says 6?!) -
KM_TYPE_NR
entries per processor are reserved at boot time for atomic mapping at the location FIX_KMAP_BEGIN and ending at FIX_KMAP_END. These are members of enum fixed_addresses and can be converted to virtual address via __fix_to_virt(). -
A user of an atomic kmap, as you might imagine, may not sleep or exit before calling kunmap_atomic() because the next process on the processor may try to use the same entry and fail.
-
kmap_atomic() has the simple task of mapping the requested page to the slot side aside in the page tables for the requested type of operation and processor.
-
kunmap_atomic() on the other hand has an interesting quirk - it will only clear the PTE via pte_clear() if debugging is enabled. This is because bothering to unmap atomic pages is considered unnecessary since the next call to
kmap_atomic()
will simply replace it and make TLB flushes unnecessary.
-
Bounce buffers are required for devices that cannot access the full range of memory available to the CPU.
-
An obvious example is when the device does not have an address with the same number of bits as the CPU, e.g. a 32-bit device on a 64-bit architecture, or a 32-bit device on a 32-bit PAE-enabled system.
-
The basic concept is very simple - a bounce buffer resides in memory low enough for a device to copy from and write data to (i.e. to allow DMA to occur), once the I/O completes, the data is then copied by the kernel to the buffer page in high memory, rendering the bounce buffer as a type of bridge.
-
There is significant overhead to this operation because, at the very least, it involves copying a full page, but this performance penalty is insignificant in comparison to swapping out pages in low memory.
-
Blocks of typically around 1KiB in size are packed into pages and managed via a struct buffer_head allocated by the slab allocator.
-
Users of buffer heads have the option of registering a callback function which is stored in
buffer_head->b_end_io()
and called when I/O completes. Bounce buffers use this mechanism to copy data out of the bounce buffers via the registered callback bounce_end_io_write() (and subsequently bounce_end_io().) -
More details on buffer heads/the block layer/etc. are rather beyond the scope of the book.
- Creating a bounce buffer is fairly straightforward and done via create_bounce() - it creates a new buffer using the specified buffer head as a template. Let's take a look at what it does in detail:
-
A page is allocated for the buffer itself via alloc_bounce_page(), which is a wrapper around alloc_page() with one important addition - if the allocation is unsuccessful, there is an emergency pool of pages and buffer heads available for bounce buffers (see 9.6.)
-
The buffer head is allocated via alloc_bounce_bh() which, similar to
alloc_bounce_page()
, calls the slab allocator for a struct buffer_head and uses the emergency pool if one cannot be allocated. Additionally,bdflush
is woken up to start flushing dirty buffers out to disk so that buffers are more likely to be freed soon via wakeup_bdflush(). -
After the page and
buffer_head
have been allocated, data is copied from the templatebuffer_head
into the new one. Because part of this operation may use kmap_atomic(), bounce buffers are only created with the IRQ-safe io_request_lock held. -
The I/O completion callbacks are changed to be either bounce_end_io_write() or bounce_end_io_read() so data will be copied to/from high memory.
-
GFP flags are set such that no I/O operations involving high memory may be used specified by
SLAB_NOHIGHIO
to the slab allocator andGFP_NOHIGHIO
to the buddy allocator. This matters because bounce buffers are used for I/O operations with high memory - if the allocator tries to perform high memory I/O, it will recurse and eventually cause a crash.
-
Data is copied via the bounce buffer different depending on whether it's a read or a write buffer. If writing to the device, the buffer is populated with data from high memory during bounce buffer creation via copy_from_high_bh(), and the callback function bounce_end_io_write() will complete the I/O later when the device is ready for the data.
-
When reading to the device, no data transfer may take place until it is ready. When it is, the interrupt handler for the device calls the callback funciton bounce_end_io_read() which copies the data to high memory via copy_to_high_bh_irq().
-
In either case, the buffer head and page may be reclaimed via bounce_end_io() after the I/O has completed, and the I/O completion function for the template struct buffer_head is called.
-
After the operation is complete, if the emergency pools are not full, the resources are added there, otherwise they are freed via their respective allocators.
-
Two emergency pools of struct buffer_head's and pages are maintained for bounce buffers alone. The reason for this is that if memory is too tight for allocations, failing to complete I/O requests is only going to compound issues, because buffers from high memory cannot be freed until low memory is available, leading to processes halting, preventing them from freeing up their own memory.
-
The pools are initialised via init_emergency_pool() to contain POOL_SIZE entries (currently hard-coded to 32.) The pages are linked by the
page->list
field in the emergency_pages list. -
Looking at how pages are acquired from emergency pools diagrammatically:
----------------------- --------------- ^
| alloc_bounce_page() | | struct page | |
----------------------- --------------- |
| ^ |
v | page->list |
---------------------------- /- - - - - - -/ |
| alloc_page(GFP_NOHIGHIO) | \ struct page \ |
---------------------------- /- - - - - - -/ |
| ^ | Max.
v | | POOL_SIZE
/----------------------\ | page->list | entries
--------------- yes / alloc_page() returns \ --------------- |
| return page |<--------\ page? / | struct page | |
--------------- \----------------------/ --------------- |
| no ^ |
v | page->list |
--------------- ----------------------------------- ------------------------------- |
| return page |<--| Acquire emergency_lock spinlock | | struct page | |
--------------- | Take page from emergency pool |<--| (emergency_pages list head) | |
| Return page | ------------------------------- V
----------------------------------- ^
|
-------------------------------------
| init_emergency_pool() initialises |
| this list with POOL_SIZE entries |
-------------------------------------
-
Similarly,
buffer_head
s are linked viabuffer_head->inode_buffers
on the emergency_bhs list. -
Both
emergency_pages
andemergency_bhs
are protected by the emergency_lock spinlock, with emergency pages counted via nr_emergency_pages, and emergency buffer heads counted via nr_emergency_bhs.