Chapter 10: Page Frame Reclamation

A running system will eventually use all available page frames for disk buffers, struct dentrys, struct inodes, process pages, etc. etc.
Linux needs to select old pages that can be freed and invalidated for new uses before physical memory becomes exhausted - this chapter focuses on how linux implements its 'page replacement policy' and how different types of pages are invalidated.
The way linux selects pages is rather empirical and contains a mix of ideas moderated by user feedback and benchmarks.
The page cache is a cache of all data that is read from disk to reduce the amount of disk I/O required. Strictly speaking this is not directly related to page frame reclamation, however the LRU lists and page cache are closely related.
With the exception of the slab allocator, all pages in use by the system are stored on LRU lists and linked together via struct page->lru so they can be easily scanned for replacement.
The slab pages are not stored on the LRU lists as it's considerably harder to age a page based on the objects used by the slab.
Process-mapped pages are not easily swappable because there is no easy way to map struct pages to PTEs except by searching every page table which is very expensive.
If the page cache has a large number of process-mapped pages in it, process page tables will be walked and pages will be swapped out via swap_out() until enough pages have been freed.
Regardless, swap_out() will have trouble dealing with shared pages - if a page is shared a swap entry is allocated, the PTE is filled out with the necessary information to find the page in swap again and the reference count is decremented. Only when the count reaches 0 will it be freed - pages like this are considered to be in the 'swap cache'.
Page replacement is performed via the the kswapd daemon.

10.1 Page Replacement Policy

The page replacement policy is said to be an LRU-based algorithm, but technically this isn't quite true because the lists are not strictly maintained in LRU order.
The LRU in linux consists of two lists - active_list and inactive_list. The idea is for the active_list to contain the working set of all processes and the inactive_list to contain reclaim candidates.
Since we store all reclaimable pages in two lists and pages belonging to any process may be reclaimed rather than just those belonging to a faulting process, the replacement policy is global.
The lists resemble a simplified LRU 2Q, in which two lists are maintained - Am and A1. A1 is a FIFO queue and if a page is referenced while on that queue they are placed in Am which is a normal LRU-managed list.
The 2Q approach is roughly analogous to using lru_cache_add() to place pages on the inactive_list (A1 in the analogy) and using mark_page_accessed() to move them to the active_list (Am in the analogy.)
The 2Q algorithm specifies how the size of the two lists have to be tuned but linux takes a simpler approach by using refill_inactive() to move pages from the bottom of the active_list to the inactive_list to keep active_list about two-thirds the size of the total page cache. Diagrammatically:

                                      -------------------                 -------------------
                                      | activate_page() |                 | lru_cache_add() |
                                      -------------------                 -------------------
                                               |                                   |
                                               |                                   |
                                               v                                   v
                                 -----------------------------     -------------------------------
                                 | add_page_to_active_list() |  /->| add_page_to_inactive_list() |
                                 -----------------------------     -------------------------------
                                               |                |                  |
                                               |                                   |
                                               v                |                  v
                                        ---------------                    -----------------
                                        |  list head  |<------\ |          |   list head   |
    ---------------------               |-------------|       |            |---------------|
    | refill_inactive() |<-\   Page     |             |       | |          |               |
    ---------------------  | removed by | active_list |       |            | inactive_list |
              |               refill_   |             |       | |          |               |
              |            | inactive() |-------------|       |            |---------------|
              v            \ - - - - - -|  list tail  |       | |          |   list tail   |
   -----------------------              ---------------       |            -----------------
   | Move nr_pages pages |                                    | |                  |
   | from active_list to |                                    |                    |
   |    inactive_list    |                                    | |                  v
   -----------------------                                    |             ----------------
              |                                  active_list  | |           | page reclaim |
/------------>|                                   "rotates"   |             ----------------
|             v                                               | |
|   /---------------------\ no       /-------------------\    |
|  /   nr_pages != 0 &&    \------->/ Clear reference bit \---/ |
|  \ active_list has pages /        \     Was it set?     / yes
|   \---------------------/          \-------------------/      |
|             | yes                            | no
|             |                                V                |
|             v                          --------------
|     ------------------                 | nr_pages-- |         |
|     | nr_pages moved |                 --------------
|     |  Job complete  |                       |                |
|     ------------------                       |
|                                              v                |
|                               -------------------------------   page moves to
|                               | del_page_from_active_list() | | inactive_list
\-------------------------------|     Set referenced bit      |-/
                                | add_page_to_inactive_list() |
                                -------------------------------

The list described for 2Q presumes Am is an LRU list, but the linux equivalent (active_list) is more like a clock algorithm where the 'handspread' is the size of the active list.
When pages reach the bottom of the active_list the 'referenced' flag is checked. If it's set, the page is moved back to the top of the list and the next page is set. If it is cleared, it is moved to the inactive_list. Regardless of what is done, it is cleared.
The 'move-to-front' heuristic means that the lists behave in an LRU-like manner, but there are just too many differences between the linux replacement policy and LRU to consider it a 'stack' algorithm.
Other differences are that the list priority is not order because that would require list updates with every reference and the lists are all but ignored when paging out from processes because this is decided base on the location of the pages in the virtual address space of the process rather than their location within the page lists.
To summarise, the algorithm exhibits LRU-like behaviour and has been shown by benchmarks to perform well in practice.
There are two cases where the algorithm performs badly:

When candidates for reclamation are mostly anonymous pages - in this case linux will keep on examining a large number of pages before linearly scanning process page tables searching for the pages to reclaim. Thankfully this is rare.
When a single process has a large number of file-backed resident pages in the inactive_list that are being written to frequently. In this case processes and kswapd may go into a loop of constantly 'laundering' these pages and putting them at the top of the inactive_list without freeing anything - in this case few pages are moved from the active_list to the inactive_list.

10.2 Page Cache

The page cache is a set of data structures that contain pages that are backed by ordinary files, block devices, or swap. There are basically 4 types of page that exist in the cache:

Pages that were 'faulted in' as a result of reading a memory-mapped file.
Pages read from a block device or filesystem - these are special pages known as 'buffer pages'. The number of blocks that may fit there depends on the size of the block and the page size of the architecture.
Anonymous pages that live in the 'swap cache', a special part of the page cache, when slots are allocated in the backing storage for page-out. Discussed further in chapter 11.
Pages belonging to shared memory regions are treated similarly to anonymous pages - the only different is that shared pages are added to the swap cache and space is reserved in backing storage immediately after the first write to the page.

The main reason for the existence of the page cache is to eliminate unnecessary disk reads. Pages read from disk are stored in a page hash table, which is hashed on the struct address_space and the offset. The hash is always searched before resorting to a disk read.
Let's take a look at the page cache API:

add_to_page_cache() - Adds a page to the LRU via lru_cache_add() in addition to adding it to the inode queue and page hash tables.
add_to_page_cache_unique() - Same as add_to_page_cache() except it checks to make sure the page isn't already there. Required when the caller doesn't hold the pagecache_lock.
remove_inode_page() - Removes a page from the inode and hash queues via remove_page_from_inode_queue() and remove_page_from_hash_queue() which effectively removes the page from the page cache altogether.
page_cache_alloc() - Wrapper around alloc_pages() that uses the struct address_space's gfp_mask field as the GFP mask.
page_cache_get() - Increment struct page->count, i.e. the page's reference count. Wrapper around get_page().
page_cache_read() - Adds a page corresponding to a specified offset and file to the page cache if not already there. If necessary, page will be read from disk via struct address_space_operations
page_cache_release() - Alias for __free_page() - reference count is decremented and if it drops to 0, the page is freed.

10.2.1 Page Cache Hash Table

Pages in the page cache need to be located quickly. Pages are inserted into the page_hash_table hash table and struct page->next_hash and ->pprev_hash are used to handle collisions.
The table is declared as follows in mm/filemap.c:

atomic_t page_cache_size = ATOMIC_INIT(0);
unsigned int page_hash_bits;
struct page **page_hash_table;

The table is allocated during system initialisation via page_cache_init():

void __init page_cache_init(unsigned long mempages)
{
        unsigned long htable_size, order;

        htable_size = mempages;
        htable_size *= sizeof(struct page *);
        for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
                ;

        do {
                unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);

                page_hash_bits = 0;
                while((tmp >>= 1UL) != 0UL)
                        page_hash_bits++;

                page_hash_table = (struct page **)
                        __get_free_pages(GFP_ATOMIC, order);
        } while(page_hash_table == NULL && --order > 0);

        printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n",
               (1 << page_hash_bits), order, (PAGE_SIZE << order));
        if (!page_hash_table)
                panic("Failed to allocate page hash table\n");
        memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *));
}

This takes mempages, the number of physical pages in the system, as a parameter and uses it to determine the hash table's size, htable_size:

htable_size = mempages;
htable_size *= sizeof(struct page *);

This is sufficient to hold pointers to every struct page in the system.
The system then determines an order such that PAGE_SIZE * 2^order < htable_size (roughly equivalent to order = floor(lg((htable_size * 2) - 1))):

for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
        ;

The pages are allocated via __get_free_pages() if possible, trying lower orders if not and panicking if unable to allocate anything.
Next the function determines page_hash_bits, the number of bits to use in the hashing function _page_hashfn():

unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);

page_hash_bits = 0;
while((tmp >>= 1UL) != 0UL)
        page_hash_bits++;

This is equivalent to page_hash_bits = lg(PAGE_SIZE*2^order/sizeof(struct page *)), which renders the table a power-of-two hash table, negating the need to use a modulus (a common choice for hashing functions.)
Finally, the page table is zeroed.

10.2.2 inode Queue

The 'inode queue' is part of the struct address_space introduced in 4.4.2.
struct address_space contains 3 lists associated with the inode - clean_pages, dirty_pages, and locked_pages - dirty pages are ones that have been changed since the last sync to disk and locked pages are unsurprisingly ones that are locked :)
These three lists in combination are considered to be the inode queue for a given mapping, and the multi-purpose struct page->list field is used to link pages on it.
Pages are added to the inode queue via add_page_to_inode_queue() and removed via remove_page_from_inode_queue().

10.2.3 Adding Pages to the Page Cache

Pages read from a file or block device are generally added to the page cache to avoid further disk I/O. Most filesystems uses the high-level generic_file_read() as their file_operations->read().
generic_file_read() does the following:

Performs a few sanity checks then, if direct I/O, hands over to generic_file_direct_IO(), otherwise calls do_generic_file_read() to do the heavy lifting (we don't consider direct I/O here.)
Searches the page cache by calling __find_page_nolock() with the pagecache_lock held to see if the page already exists in it.
If the page does not already exist, a new page is allocated via page_cache_alloc() and added to the page cache via __add_to_page_cache().
After a page frame is present in the page cache, generic_file_readahead() is called which uses page_cache_read() to read pages from disk via mapping->a_ops->readpage() where mapping is the struct address_space managing the file.

Anonymous pages are added to the swap cache when they are unmapped from a process (see 11.4) and have no struct address_space acting as a mapping, or any offset within a file, which leaves nothing to hash them into the page cache with.
These anonymous pages will still exist on the LRU lists, however. Once in the swap cache, the only real difference between anonymous pages and file-backed pages is that anonymous pages will use swapper_space as their struct address_space.
Shared memory pages are added when:

shmem_getpage() is called when a page has be either fetched from swap or allocated because it is the first reference.
The swapout code calls shmem_unuse() when a swap area is being deactivated and a page backed by swap space is found that does not appear to belong to any process. In this case the inodes related to shared memory are exhaustively searched until the correct page is found.

In both cases the page is added with add_to_page_cache().

10.3 LRU Lists

As discussed in 10.1, active_list and inactive_list are declared in mm/page_alloc.c and are protected by pagemap_lru_lock (a wrapper around pagemap_lru_lock_cacheline.) The active and inactive lists roughly represent 'hot' and 'cold' pages in the system, respectively.
In other words, active_list contains all the working sets in the systems, and inactive_list contains reclaim candidates.
Taking a look at the LRU list API:

lru_cache_add() - Adds a 'cold' page to the inactive_list. It will be moved to the active_list with a call to mark_page_accessed() if the page is known to be 'hot' such as when a page is faulted in.
lru_cache_del() - Removes a page from the LRU lists by calling one of either del_page_from_active_list() or del_page_from_inactive_list() as appropriate.
mark_page_accessed() - Marks that the specified page has been accessed. If it was not recently referenced (i.e. in the inactive_list and the PG_referenced flag is not set), the referenced flag is set. If the page is referenced a second time, activate_page() is called which marks the page 'hot' and the reference count is cleared.
activate_page() - Removes a page from the inactive_list and places it on the active_list. It's rarely called directly because the caller has to know the page is on the inactive_list. In most cases mark_page_accessed() should be used instead.

10.3.1 Refilling inactive_list

When caches are shrunk, pages are moved from the active_list to the inactive_list via refill_inactive().
refill_inactive() is parameterised by the number of pages to move (nr_pages) which is calculated in shrink_caches() as:

pages = nr_pages * nr_active_pages / (2 * (nr_inactive_pages + 1))

This keeps the active_list about two-thirds the size of the inactive_list and the number of pages to move is determined as a ratio scaled by the number of pages we want to swap out (nr_pages.)
Pages are taken from the end of the active_list and if the PG_referenced flag is set it is cleared and the page is put at the top of the active_list because it has been recently used and is still 'hot' (sometimes referred to as 'rotating the list'.)
If the PG_referenced flag is not set when checked the page is moved to the inactive_list and the PG_referenced flag is set so it will be quickly promoted to the active_list if necessary.

10.3.2 Reclaiming Pages From the LRU Lists

shrink_cache() is the part of the replacement algorithm that takes pages from inactive_list and determines how they should be swapped out.
shrink_cache() is parameterised by nr_pages and priority, with nr_pages starting as SWAP_CLUSTER_MAX (set at 32) and priority starting as DEF_PRIORITY (set at 6.)
The local variables max_scan and max_mapped determine how much work the function will do and are impacted by the priority parameter.
Each time shrink_caches() is called and enough pages are not freed, the priority is decreased (i.e. made higher priority) until priority 1 is reached.
max_scan simply represents the maximum number of pages that will be scanned by the function and is set as max_scan = nr_inactive_pages/priority. This means that at the lowest priority of 6, at most one-sixth of the pages in the inactive_list will be scanned, and at the highest priority all of them will be.
max_mapped determines how many process pages are allowed to exist in the page cache before whole processes will be swapped out. It is calculated as the minimum or either 0.1 * max_scan or nr_pages * 2^(10 - priority).
This means that at the lowest priority the maximum number of mapped pages allowed is either one-tenth of max_scan or 16 (16 = 2^(10-6) = 2^4) times the number of pages to swap out, whichever is smaller.
At the highest priority the maximum number of mapped pages is either one-tenth of max_scan or 512 (512 = 2^(10-1) = 2^9) times the number of pages to swap out, whichever is smaller.
From this point on shrink_cache() is basically a larger for-loop that scans at most max_scan pages to free up nr_pages pages from the end of inactive_list or until inactive_list is empty.
After each page released, shrink_cache() checks whether it should reschedule itself so the CPU is not monopolised.
For each type of page found on the list, shrink_cache() behaves differently depending on the state of the page:

Page is mapped by a process - Jump to the page_mapped label, decrement max_mapped unless it's reached 0 at which point linearly scan the page tables of processes and swap them out via swap_out().
Page is locked and the PG_launder bit is set - The page is locked for I/O so it could be skipped, however PG_launder implies this is the 2nd time the page has been found locked and so it's better to wait until the I/O completes and get rid of it. A reference to the page is taken via page_cache_get() so the page won't be freed prematurely and wait_on_page() is called which sleeps until the I/O is complete. Once it's completed, the reference count is decremented via page_cache_release() and when this count reaches 0 the page will be reclaimed.
Page is dirty, unmapped by all process, has no buffers and belongs to a device or file mapping - Since the page belongs to a file/device mapping it has a page->mapping->a_ops->writepage() function. PG_dirty is cleared and PG_launder is set because I/O is about to begin. We take a reference for the page via page_cache_get() before calling its writepage() function to synchronise it with the backing file before dropping the reference via page_cache_release(). Note that anonymous pages that are part of the swap cache will also get synchronised because they use swapper_space as their page->mapping. The page remains on the LRU - when it's found again it'll be freed if the I/O has completed, if not the kernel will wait for it to complete (see case 2.)
Page has buffers associated with data on disk - A reference is taken to the page and an attempt is made to free it via try_to_release_page(). If this is successful and the page is anonymous (i.e. page->mapping is NULL) the page is removed from the LRU and page_cache_release() is called to decrement the usage count. There is one case where the anonymous page has associated buffers, and that's when it's backed by a swap file because the page needs to be written out in block-sized chunks. In other cases however where the page is backed by a file or device the reference is simply dropped and the page will be freed as usual when its reference count reaches 0.
Page is anonymous and is mapped by more than one process - The LRU then page are unlocked before dropping into the same page_mapped label as case 1 visited. Hence max_mapped is decremented and swap_out() is called if and when max_mapped reaches 0.
Page has no process referencing it - This is the fall-through case - If the page is in the swap cache, it is removed because it is now synchronised with the backing storage and has no process referencing it. If it was part of a file, it's removed from the inode queue, deleted from the page cache and freed.

10.4 Shrinking All Caches

shrink_caches() is a simple function which takes a few steps to free up some memory. It is invoked by try_to_free_pages_zone():

int try_to_free_pages_zone(zone_t *classzone, unsigned int gfp_mask)
{
        int priority = DEF_PRIORITY;
        int nr_pages = SWAP_CLUSTER_MAX;

        gfp_mask = pf_gfp_mask(gfp_mask);
        do {
                nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages);
                if (nr_pages <= 0)
                        return 1;
        } while (--priority);

        /*
         * Hmm.. Cache shrink failed - time to kill something?
         * Mhwahahhaha! This is the part I really like. Giggle.
         */
        out_of_memory();
        return 0;
}

nr_pages is initialised to SWAP_CLUSTER_MAX (32) as an upper bound. This restriction is in place so that if kswapd schedules a large number of pages to be written to disk it'll sleep occasionally to allow the I/O to take place.
As pages are freed, nr_pages is decremented to keep count.
The amount of work that will beperformed also depends on the priority, which is initialised to DEF_PRIORITY here (6) and is decremented after each pass that does not free up enough pages up to a maximum of priority 1.
Taking a look at shrink_caches():

static int shrink_caches(zone_t * classzone, int priority, unsigned int gfp_mask, int nr_pages)
{
        int chunk_size = nr_pages;
        unsigned long ratio;

        nr_pages -= kmem_cache_reap(gfp_mask);
        if (nr_pages <= 0)
                return 0;

        nr_pages = chunk_size;
        /* try to keep the active list 2/3 of the size of the cache */
        ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
        refill_inactive(ratio);

        nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
        if (nr_pages <= 0)
                return 0;

        shrink_dcache_memory(priority, gfp_mask);
        shrink_icache_memory(priority, gfp_mask);
#ifdef CONFIG_QUOTA
        shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
#endif

        return nr_pages;
}

This first calls kmem_cache_reap() (see 8.1.7) which selects a slab cache to shrink. If nr_pages pages are freed the function is done. Otherwise, it tries to free from other caches.
If other caches are affected, refill_inactive() will move pages from active_list to inactive_list before shrinking the page cache by reclaiming pages at the end of the inactive_list with shrink_cache().
Finally if more pages need to be freed it will shrink 3 special caches - the dcache, icache and dqcache via shrink_dcache_memory(), shrink_icache_memory() and shrink_dqcache_memory() respectively.
The objects in these special caches are relatively small themselves, but a cascading effect allows a lot more pages to be freed in terms of buffer and disk caches.

10.5 Swapping Out Process Pages

When max_mapped pages have been found in the page cache in shrink_cache(), swap_out() is called to swap out process pages. Starting from the struct mm_struct pointed to by swap_mm and mm->swap_address, page tables are searched forward until nr_pages (local variable defaulting to SWAP_CLUSTER_MAX) have been freed.
All process-mapped pages are examined regardless of where they are in the lists or when they were last referenced, but pages that are part of the active_list or have recently been referenced will be skipped out.
The examination of 'hot' pages is costly, but insignificant compared to linearly searching all processes for the PTEs that reference a particular struct page.
After it's been decided to swap out pages from a process, an attempt will be made to swap out at least SWAP_CLUSTER_MAX pages, examining the list of struct mm_structs once only to avoid constant looping when no pages are available.
Additionally, writing out the pages in bulk increases the chances that pages close together in the process address space will be written out to adjacent slots on the disk.
swap_mm is initialised to point to init_mm and its swap_address field is set to 0 the first time it's used.
A task has been fully searched when its struct mm_struct->swap_address is equal to TASK_SIZE.
After a task has been selected to swap pages from, the reference count to the mm_struct is incremented so that it will not be freed early, and swap_out_mm() is called parameterised by the chosen mm_struct.
swap_out_mm() walks each of the process's VMAs and calls swap_out_vma() for each. This is done to avoid having to walk the entire page table which will largely be sparse.
swap_out_vma() calls swap_out_pgd() which calls swap_out_pmd() which calls try_to_swap_out() in turn.
try_to_swap_out() does the following:

Ensures the page is not part of active_list, nor has been recently referenced or belongs to a zone that we aren't interested in.
Removes the page from the process page tables.
Examine the newly removed PTE to see whether it's dirty - if so, the struct page flags will be updated such that it will get synchronised with the backing storage.
If the page is already part of the swap cache the RSS is simply updated, and the reference to the page is dropped. Otherwise it is added to the swap cache (chapter 11 goes into more detail about how pages are added to the swap cached and synchronised with backing storage.)

10.6 Pageout Daemon (kswapd)

During system startup, a kernel thread called kswapd is started via kswapd_init() which continuously executes kswapd(), which is usually asleep like a cheeky kitteh cat.
The kswapd daemon is responsible for reclaiming pages when memory is running low. Historically it used to be woken every 10 seconds, but now it is woken by the physical page allocator when the zone's free pages count reaches its pages_low field (see 2.2.1.)
The kswapd daemon that performs most of the tasks needed to maintain the page cache correctly - shrinking slab caches and swapping out processes when necessary.
To avoid being woken up a great deal under high memory pressure, kswapd keeps on freeing pages until the zone's pages_high watermark is reached.
Under extreme memory pressure, processes will do the work of kswapd synchronously by calling balance_classzone() which calls try_to_free_pages_zone() that does kswapd's work.
When kswapd is woken up, it does the following:

Calls kswapd_can_sleep() which checks all zones' need_balance fields - if any are set it cannot sleep.
If it can't sleep, it's removed from the kswapd_wait wait queue.
Calls kswapd_balance() to cycle through all zones and free pages via try_to_free_pages_zone() if need_balance is set, freeing until the pages_high watermark is reached.
The task queue for tq_disk is run so that queued pages will be written out.
Adds kswapd back to the kswapd_wait wait queue and loops back to step 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10.md

10.md

Chapter 10: Page Frame Reclamation

10.1 Page Replacement Policy

10.2 Page Cache

10.2.1 Page Cache Hash Table

10.2.2 inode Queue

10.2.3 Adding Pages to the Page Cache

10.3 LRU Lists

10.3.1 Refilling inactive_list

10.3.2 Reclaiming Pages From the LRU Lists

10.4 Shrinking All Caches

10.5 Swapping Out Process Pages

10.6 Pageout Daemon (kswapd)

Files

10.md

Latest commit

History

10.md

File metadata and controls

Chapter 10: Page Frame Reclamation

10.1 Page Replacement Policy

10.2 Page Cache

10.2.1 Page Cache Hash Table

10.2.2 inode Queue

10.2.3 Adding Pages to the Page Cache

10.3 LRU Lists

10.3.1 Refilling inactive_list

10.3.2 Reclaiming Pages From the LRU Lists

10.4 Shrinking All Caches

10.5 Swapping Out Process Pages

10.6 Pageout Daemon (kswapd)