-
A running system will eventually use all available page frames for disk buffers, struct dentrys, struct inodes, process pages, etc. etc.
-
Linux needs to select old pages that can be freed and invalidated for new uses before physical memory becomes exhausted - this chapter focuses on how linux implements its 'page replacement policy' and how different types of pages are invalidated.
-
The way linux selects pages is rather empirical and contains a mix of ideas moderated by user feedback and benchmarks.
-
The page cache is a cache of all data that is read from disk to reduce the amount of disk I/O required. Strictly speaking this is not directly related to page frame reclamation, however the LRU lists and page cache are closely related.
-
With the exception of the slab allocator, all pages in use by the system are stored on LRU lists and linked together via struct page
->lru
so they can be easily scanned for replacement. -
The slab pages are not stored on the LRU lists as it's considerably harder to age a page based on the objects used by the slab.
-
Process-mapped pages are not easily swappable because there is no easy way to map struct pages to PTEs except by searching every page table which is very expensive.
-
If the page cache has a large number of process-mapped pages in it, process page tables will be walked and pages will be swapped out via swap_out() until enough pages have been freed.
-
Regardless,
swap_out()
will have trouble dealing with shared pages - if a page is shared a swap entry is allocated, the PTE is filled out with the necessary information to find the page in swap again and the reference count is decremented. Only when the count reaches 0 will it be freed - pages like this are considered to be in the 'swap cache'. -
Page replacement is performed via the the
kswapd
daemon.
-
The page replacement policy is said to be an LRU-based algorithm, but technically this isn't quite true because the lists are not strictly maintained in LRU order.
-
The LRU in linux consists of two lists - active_list and inactive_list. The idea is for the
active_list
to contain the working set of all processes and theinactive_list
to contain reclaim candidates. -
Since we store all reclaimable pages in two lists and pages belonging to any process may be reclaimed rather than just those belonging to a faulting process, the replacement policy is global.
-
The lists resemble a simplified LRU 2Q, in which two lists are maintained -
Am
andA1
.A1
is a FIFO queue and if a page is referenced while on that queue they are placed inAm
which is a normal LRU-managed list. -
The 2Q approach is roughly analogous to using lru_cache_add() to place pages on the
inactive_list
(A1
in the analogy) and using mark_page_accessed() to move them to theactive_list
(Am
in the analogy.) -
The 2Q algorithm specifies how the size of the two lists have to be tuned but linux takes a simpler approach by using refill_inactive() to move pages from the bottom of the
active_list
to theinactive_list
to keepactive_list
about two-thirds the size of the total page cache. Diagrammatically:
------------------- -------------------
| activate_page() | | lru_cache_add() |
------------------- -------------------
| |
| |
v v
----------------------------- -------------------------------
| add_page_to_active_list() | /->| add_page_to_inactive_list() |
----------------------------- -------------------------------
| | |
| |
v | v
--------------- -----------------
| list head |<------\ | | list head |
--------------------- |-------------| | |---------------|
| refill_inactive() |<-\ Page | | | | | |
--------------------- | removed by | active_list | | | inactive_list |
| refill_ | | | | | |
| | inactive() |-------------| | |---------------|
v \ - - - - - -| list tail | | | | list tail |
----------------------- --------------- | -----------------
| Move nr_pages pages | | | |
| from active_list to | | |
| inactive_list | | | v
----------------------- | ----------------
| active_list | | | page reclaim |
/------------>| "rotates" | ----------------
| v | |
| /---------------------\ no /-------------------\ |
| / nr_pages != 0 && \------->/ Clear reference bit \---/ |
| \ active_list has pages / \ Was it set? / yes
| \---------------------/ \-------------------/ |
| | yes | no
| | V |
| v --------------
| ------------------ | nr_pages-- | |
| | nr_pages moved | --------------
| | Job complete | | |
| ------------------ |
| v |
| ------------------------------- page moves to
| | del_page_from_active_list() | | inactive_list
\-------------------------------| Set referenced bit |-/
| add_page_to_inactive_list() |
-------------------------------
-
The list described for 2Q presumes
Am
is an LRU list, but the linux equivalent (active_list
) is more like a clock algorithm where the 'handspread' is the size of the active list. -
When pages reach the bottom of the
active_list
the 'referenced' flag is checked. If it's set, the page is moved back to the top of the list and the next page is set. If it is cleared, it is moved to theinactive_list
. Regardless of what is done, it is cleared. -
The 'move-to-front' heuristic means that the lists behave in an LRU-like manner, but there are just too many differences between the linux replacement policy and LRU to consider it a 'stack' algorithm.
-
Other differences are that the list priority is not order because that would require list updates with every reference and the lists are all but ignored when paging out from processes because this is decided base on the location of the pages in the virtual address space of the process rather than their location within the page lists.
-
To summarise, the algorithm exhibits LRU-like behaviour and has been shown by benchmarks to perform well in practice.
-
There are two cases where the algorithm performs badly:
-
When candidates for reclamation are mostly anonymous pages - in this case linux will keep on examining a large number of pages before linearly scanning process page tables searching for the pages to reclaim. Thankfully this is rare.
-
When a single process has a large number of file-backed resident pages in the
inactive_list
that are being written to frequently. In this case processes andkswapd
may go into a loop of constantly 'laundering' these pages and putting them at the top of theinactive_list
without freeing anything - in this case few pages are moved from theactive_list
to theinactive_list
.
- The page cache is a set of data structures that contain pages that are backed by ordinary files, block devices, or swap. There are basically 4 types of page that exist in the cache:
-
Pages that were 'faulted in' as a result of reading a memory-mapped file.
-
Pages read from a block device or filesystem - these are special pages known as 'buffer pages'. The number of blocks that may fit there depends on the size of the block and the page size of the architecture.
-
Anonymous pages that live in the 'swap cache', a special part of the page cache, when slots are allocated in the backing storage for page-out. Discussed further in chapter 11.
-
Pages belonging to shared memory regions are treated similarly to anonymous pages - the only different is that shared pages are added to the swap cache and space is reserved in backing storage immediately after the first write to the page.
-
The main reason for the existence of the page cache is to eliminate unnecessary disk reads. Pages read from disk are stored in a page hash table, which is hashed on the struct address_space and the offset. The hash is always searched before resorting to a disk read.
-
Let's take a look at the page cache API:
-
add_to_page_cache() - Adds a page to the LRU via lru_cache_add() in addition to adding it to the inode queue and page hash tables.
-
add_to_page_cache_unique() - Same as
add_to_page_cache()
except it checks to make sure the page isn't already there. Required when the caller doesn't hold the pagecache_lock. -
remove_inode_page() - Removes a page from the inode and hash queues via remove_page_from_inode_queue() and remove_page_from_hash_queue() which effectively removes the page from the page cache altogether.
-
page_cache_alloc() - Wrapper around alloc_pages() that uses the struct address_space's
gfp_mask
field as the GFP mask. -
page_cache_get() - Increment struct page
->count
, i.e. the page's reference count. Wrapper around get_page(). -
page_cache_read() - Adds a page corresponding to a specified
offset
andfile
to the page cache if not already there. If necessary, page will be read from disk via struct address_space_operations -
page_cache_release() - Alias for __free_page() - reference count is decremented and if it drops to 0, the page is freed.
-
Pages in the page cache need to be located quickly. Pages are inserted into the page_hash_table hash table and struct page
->next_hash
and->pprev_hash
are used to handle collisions. -
The table is declared as follows in
mm/filemap.c
:
atomic_t page_cache_size = ATOMIC_INIT(0);
unsigned int page_hash_bits;
struct page **page_hash_table;
- The table is allocated during system initialisation via page_cache_init():
void __init page_cache_init(unsigned long mempages)
{
unsigned long htable_size, order;
htable_size = mempages;
htable_size *= sizeof(struct page *);
for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
;
do {
unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);
page_hash_bits = 0;
while((tmp >>= 1UL) != 0UL)
page_hash_bits++;
page_hash_table = (struct page **)
__get_free_pages(GFP_ATOMIC, order);
} while(page_hash_table == NULL && --order > 0);
printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n",
(1 << page_hash_bits), order, (PAGE_SIZE << order));
if (!page_hash_table)
panic("Failed to allocate page hash table\n");
memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *));
}
- This takes
mempages
, the number of physical pages in the system, as a parameter and uses it to determine the hash table's size,htable_size
:
htable_size = mempages;
htable_size *= sizeof(struct page *);
-
This is sufficient to hold pointers to every struct page in the system.
-
The system then determines an
order
such thatPAGE_SIZE * 2^order < htable_size
(roughly equivalent toorder = floor(lg((htable_size * 2) - 1))
):
for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
;
-
The pages are allocated via __get_free_pages() if possible, trying lower orders if not and panicking if unable to allocate anything.
-
Next the function determines
page_hash_bits
, the number of bits to use in the hashing function _page_hashfn():
unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);
page_hash_bits = 0;
while((tmp >>= 1UL) != 0UL)
page_hash_bits++;
-
This is equivalent to
page_hash_bits = lg(PAGE_SIZE*2^order/sizeof(struct page *))
, which renders the table a power-of-two hash table, negating the need to use a modulus (a common choice for hashing functions.) -
Finally, the page table is zeroed.
-
The 'inode queue' is part of the struct address_space introduced in 4.4.2.
-
struct address_space
contains 3 lists associated with the inode -clean_pages
,dirty_pages
, andlocked_pages
- dirty pages are ones that have been changed since the last sync to disk and locked pages are unsurprisingly ones that are locked :) -
These three lists in combination are considered to be the inode queue for a given mapping, and the multi-purpose struct page
->list
field is used to link pages on it. -
Pages are added to the inode queue via add_page_to_inode_queue() and removed via remove_page_from_inode_queue().
-
Pages read from a file or block device are generally added to the page cache to avoid further disk I/O. Most filesystems uses the high-level generic_file_read() as their
file_operations->read()
. -
generic_file_read()
does the following:
-
Performs a few sanity checks then, if direct I/O, hands over to generic_file_direct_IO(), otherwise calls do_generic_file_read() to do the heavy lifting (we don't consider direct I/O here.)
-
Searches the page cache by calling __find_page_nolock() with the pagecache_lock held to see if the page already exists in it.
-
If the page does not already exist, a new page is allocated via page_cache_alloc() and added to the page cache via __add_to_page_cache().
-
After a page frame is present in the page cache, generic_file_readahead() is called which uses page_cache_read() to read pages from disk via
mapping->a_ops->readpage()
wheremapping
is the struct address_space managing the file.
-
Anonymous pages are added to the swap cache when they are unmapped from a process (see 11.4) and have no
struct address_space
acting as a mapping, or any offset within a file, which leaves nothing to hash them into the page cache with. -
These anonymous pages will still exist on the LRU lists, however. Once in the swap cache, the only real difference between anonymous pages and file-backed pages is that anonymous pages will use swapper_space as their
struct address_space
. -
Shared memory pages are added when:
-
shmem_getpage() is called when a page has be either fetched from swap or allocated because it is the first reference.
-
The swapout code calls shmem_unuse() when a swap area is being deactivated and a page backed by swap space is found that does not appear to belong to any process. In this case the inodes related to shared memory are exhaustively searched until the correct page is found.
- In both cases the page is added with add_to_page_cache().
-
As discussed in 10.1, active_list and inactive_list are declared in
mm/page_alloc.c
and are protected by pagemap_lru_lock (a wrapper around pagemap_lru_lock_cacheline.) The active and inactive lists roughly represent 'hot' and 'cold' pages in the system, respectively. -
In other words,
active_list
contains all the working sets in the systems, andinactive_list
contains reclaim candidates. -
Taking a look at the LRU list API:
-
lru_cache_add() - Adds a 'cold' page to the inactive_list. It will be moved to the active_list with a call to
mark_page_accessed()
if the page is known to be 'hot' such as when a page is faulted in. -
lru_cache_del() - Removes a page from the LRU lists by calling one of either del_page_from_active_list() or del_page_from_inactive_list() as appropriate.
-
mark_page_accessed() - Marks that the specified page has been accessed. If it was not recently referenced (i.e. in the
inactive_list
and thePG_referenced
flag is not set), the referenced flag is set. If the page is referenced a second time,activate_page()
is called which marks the page 'hot' and the reference count is cleared. -
activate_page() - Removes a page from the
inactive_list
and places it on theactive_list
. It's rarely called directly because the caller has to know the page is on theinactive_list
. In most casesmark_page_accessed()
should be used instead.
-
When caches are shrunk, pages are moved from the
active_list
to theinactive_list
via refill_inactive(). -
refill_inactive()
is parameterised by the number of pages to move (nr_pages
) which is calculated in shrink_caches() as:
pages = nr_pages * nr_active_pages / (2 * (nr_inactive_pages + 1))
-
This keeps the
active_list
about two-thirds the size of theinactive_list
and the number of pages to move is determined as a ratio scaled by the number of pages we want to swap out (nr_pages
.) -
Pages are taken from the end of the
active_list
and if thePG_referenced
flag is set it is cleared and the page is put at the top of theactive_list
because it has been recently used and is still 'hot' (sometimes referred to as 'rotating the list'.) -
If the
PG_referenced
flag is not set when checked the page is moved to theinactive_list
and thePG_referenced
flag is set so it will be quickly promoted to theactive_list
if necessary.
-
shrink_cache() is the part of the replacement algorithm that takes pages from
inactive_list
and determines how they should be swapped out. -
shrink_cache()
is parameterised bynr_pages
andpriority
, withnr_pages
starting as SWAP_CLUSTER_MAX (set at 32) andpriority
starting as DEF_PRIORITY (set at 6.) -
The local variables
max_scan
andmax_mapped
determine how much work the function will do and are impacted by thepriority
parameter. -
Each time shrink_caches() is called and enough pages are not freed, the priority is decreased (i.e. made higher priority) until priority 1 is reached.
-
max_scan
simply represents the maximum number of pages that will be scanned by the function and is set asmax_scan = nr_inactive_pages/priority
. This means that at the lowest priority of 6, at most one-sixth of the pages in theinactive_list
will be scanned, and at the highest priority all of them will be. -
max_mapped
determines how many process pages are allowed to exist in the page cache before whole processes will be swapped out. It is calculated as the minimum or either0.1 * max_scan
ornr_pages * 2^(10 - priority)
. -
This means that at the lowest priority the maximum number of mapped pages allowed is either one-tenth of
max_scan
or 16 (16 = 2^(10-6) = 2^4
) times the number of pages to swap out, whichever is smaller. -
At the highest priority the maximum number of mapped pages is either one-tenth of
max_scan
or 512 (512 = 2^(10-1) = 2^9
) times the number of pages to swap out, whichever is smaller. -
From this point on shrink_cache() is basically a larger for-loop that scans at most
max_scan
pages to free upnr_pages
pages from the end ofinactive_list
or untilinactive_list
is empty. -
After each page released,
shrink_cache()
checks whether it should reschedule itself so the CPU is not monopolised. -
For each type of page found on the list,
shrink_cache()
behaves differently depending on the state of the page:
-
Page is mapped by a process - Jump to the
page_mapped
label, decrementmax_mapped
unless it's reached 0 at which point linearly scan the page tables of processes and swap them out via swap_out(). -
Page is locked and the
PG_launder
bit is set - The page is locked for I/O so it could be skipped, howeverPG_launder
implies this is the 2nd time the page has been found locked and so it's better to wait until the I/O completes and get rid of it. A reference to the page is taken via page_cache_get() so the page won't be freed prematurely and wait_on_page() is called which sleeps until the I/O is complete. Once it's completed, the reference count is decremented via page_cache_release() and when this count reaches 0 the page will be reclaimed. -
Page is dirty, unmapped by all process, has no buffers and belongs to a device or file mapping - Since the page belongs to a file/device mapping it has a
page->mapping->a_ops->writepage()
function.PG_dirty
is cleared andPG_launder
is set because I/O is about to begin. We take a reference for the page via page_cache_get() before calling itswritepage()
function to synchronise it with the backing file before dropping the reference via page_cache_release(). Note that anonymous pages that are part of the swap cache will also get synchronised because they use swapper_space as theirpage->mapping
. The page remains on the LRU - when it's found again it'll be freed if the I/O has completed, if not the kernel will wait for it to complete (see case 2.) -
Page has buffers associated with data on disk - A reference is taken to the page and an attempt is made to free it via try_to_release_page(). If this is successful and the page is anonymous (i.e.
page->mapping
isNULL
) the page is removed from the LRU andpage_cache_release()
is called to decrement the usage count. There is one case where the anonymous page has associated buffers, and that's when it's backed by a swap file because the page needs to be written out in block-sized chunks. In other cases however where the page is backed by a file or device the reference is simply dropped and the page will be freed as usual when its reference count reaches 0. -
Page is anonymous and is mapped by more than one process - The LRU then page are unlocked before dropping into the same
page_mapped
label as case 1 visited. Hencemax_mapped
is decremented and swap_out() is called if and whenmax_mapped
reaches 0. -
Page has no process referencing it - This is the fall-through case - If the page is in the swap cache, it is removed because it is now synchronised with the backing storage and has no process referencing it. If it was part of a file, it's removed from the inode queue, deleted from the page cache and freed.
- shrink_caches() is a simple function which takes a few steps to free up some memory. It is invoked by try_to_free_pages_zone():
int try_to_free_pages_zone(zone_t *classzone, unsigned int gfp_mask)
{
int priority = DEF_PRIORITY;
int nr_pages = SWAP_CLUSTER_MAX;
gfp_mask = pf_gfp_mask(gfp_mask);
do {
nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages);
if (nr_pages <= 0)
return 1;
} while (--priority);
/*
* Hmm.. Cache shrink failed - time to kill something?
* Mhwahahhaha! This is the part I really like. Giggle.
*/
out_of_memory();
return 0;
}
-
nr_pages
is initialised to SWAP_CLUSTER_MAX (32) as an upper bound. This restriction is in place so that ifkswapd
schedules a large number of pages to be written to disk it'll sleep occasionally to allow the I/O to take place. -
As pages are freed,
nr_pages
is decremented to keep count. -
The amount of work that will beperformed also depends on the
priority
, which is initialised to DEF_PRIORITY here (6) and is decremented after each pass that does not free up enough pages up to a maximum of priority 1. -
Taking a look at shrink_caches():
static int shrink_caches(zone_t * classzone, int priority, unsigned int gfp_mask, int nr_pages)
{
int chunk_size = nr_pages;
unsigned long ratio;
nr_pages -= kmem_cache_reap(gfp_mask);
if (nr_pages <= 0)
return 0;
nr_pages = chunk_size;
/* try to keep the active list 2/3 of the size of the cache */
ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
refill_inactive(ratio);
nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
if (nr_pages <= 0)
return 0;
shrink_dcache_memory(priority, gfp_mask);
shrink_icache_memory(priority, gfp_mask);
#ifdef CONFIG_QUOTA
shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
#endif
return nr_pages;
}
-
This first calls kmem_cache_reap() (see 8.1.7) which selects a slab cache to shrink. If
nr_pages
pages are freed the function is done. Otherwise, it tries to free from other caches. -
If other caches are affected, refill_inactive() will move pages from active_list to inactive_list before shrinking the page cache by reclaiming pages at the end of the
inactive_list
with shrink_cache(). -
Finally if more pages need to be freed it will shrink 3 special caches - the dcache, icache and dqcache via shrink_dcache_memory(), shrink_icache_memory() and shrink_dqcache_memory() respectively.
-
The objects in these special caches are relatively small themselves, but a cascading effect allows a lot more pages to be freed in terms of buffer and disk caches.
-
When
max_mapped
pages have been found in the page cache in shrink_cache(), swap_out() is called to swap out process pages. Starting from the struct mm_struct pointed to by swap_mm andmm->swap_address
, page tables are searched forward untilnr_pages
(local variable defaulting to SWAP_CLUSTER_MAX) have been freed. -
All process-mapped pages are examined regardless of where they are in the lists or when they were last referenced, but pages that are part of the active_list or have recently been referenced will be skipped out.
-
The examination of 'hot' pages is costly, but insignificant compared to linearly searching all processes for the PTEs that reference a particular struct page.
-
After it's been decided to swap out pages from a process, an attempt will be made to swap out at least SWAP_CLUSTER_MAX pages, examining the list of struct mm_structs once only to avoid constant looping when no pages are available.
-
Additionally, writing out the pages in bulk increases the chances that pages close together in the process address space will be written out to adjacent slots on the disk.
-
swap_mm is initialised to point to init_mm and its
swap_address
field is set to 0 the first time it's used. -
A task has been fully searched when its struct mm_struct
->swap_address
is equal to TASK_SIZE. -
After a task has been selected to swap pages from, the reference count to the
mm_struct
is incremented so that it will not be freed early, and swap_out_mm() is called parameterised by the chosenmm_struct
. -
swap_out_mm()
walks each of the process's VMAs and calls swap_out_vma() for each. This is done to avoid having to walk the entire page table which will largely be sparse. -
swap_out_vma()
calls swap_out_pgd() which calls swap_out_pmd() which calls try_to_swap_out() in turn. -
try_to_swap_out()
does the following:
-
Ensures the page is not part of active_list, nor has been recently referenced or belongs to a zone that we aren't interested in.
-
Removes the page from the process page tables.
-
Examine the newly removed PTE to see whether it's dirty - if so, the struct page flags will be updated such that it will get synchronised with the backing storage.
-
If the page is already part of the swap cache the RSS is simply updated, and the reference to the page is dropped. Otherwise it is added to the swap cache (chapter 11 goes into more detail about how pages are added to the swap cached and synchronised with backing storage.)
-
During system startup, a kernel thread called
kswapd
is started via kswapd_init() which continuously executes kswapd(), which is usually asleep like a cheeky kitteh cat. -
The
kswapd
daemon is responsible for reclaiming pages when memory is running low. Historically it used to be woken every 10 seconds, but now it is woken by the physical page allocator when the zone's free pages count reaches itspages_low
field (see 2.2.1.) -
The
kswapd
daemon that performs most of the tasks needed to maintain the page cache correctly - shrinking slab caches and swapping out processes when necessary. -
To avoid being woken up a great deal under high memory pressure,
kswapd
keeps on freeing pages until the zone'spages_high
watermark is reached. -
Under extreme memory pressure, processes will do the work of
kswapd
synchronously by calling balance_classzone() which calls try_to_free_pages_zone() that doeskswapd
's work. -
When
kswapd
is woken up, it does the following:
-
Calls kswapd_can_sleep() which checks all zones'
need_balance
fields - if any are set it cannot sleep. -
If it can't sleep, it's removed from the kswapd_wait wait queue.
-
Calls kswapd_balance() to cycle through all zones and free pages via try_to_free_pages_zone() if
need_balance
is set, freeing until thepages_high
watermark is reached. -
The task queue for tq_disk is run so that queued pages will be written out.
-
Adds kswapd back to the kswapd_wait wait queue and loops back to step 1.