-
Just as linux uses free memory for buffering data from a disk, the reverse is true as well - eventually there's a need to free up private or anonymous pages used by a process. The pages are copied to backing storage, sometimes called the 'swap area'.
-
Strictly speaking, linux doesn't swap because 'swapping' refers to copying an entire process address space to disk and 'paging' to copying out individual pages, however it's referred to as 'swapping' in discussion and documentation so we'll call it swapping regardless.
-
There are 2 principal reasons for the existence of swap space:
-
It expands the amount of memory a process may use - virtual memory and swap space allows a large process to run even if the process is only partially resident. Because old pages may be swapped out, the amount of memory addressed may easily exceed RAM because demand paging will ensure the pages are reloaded if necessary.
-
Even if sufficient memory, swap is useful - a significant number of pages referenced by a process early on in its life may only be used for initialisation then never used again. It's better to swap out these pages and create more disk buffers than leave them resident and unused.
-
Swap is slow. Very slow as disks are slow (relatively to memory) - if processes are frequently addressing a large amount of memory no amount of swap or fast disks will make it run without a reasonable time, only more RAM can help.
-
It's very important that the correct page be swapped out (as discussed in chapter 10), and also that related pages be stored close together in the swap space so they are likely to be swapped in at the same time while reading ahead.
- Each active swap area has a struct swap_info_struct describing the area:
/*
* The in-memory structure used to track swap areas.
*/
struct swap_info_struct {
unsigned int flags;
kdev_t swap_device;
spinlock_t sdev_lock;
struct dentry * swap_file;
struct vfsmount *swap_vfsmnt;
unsigned short * swap_map;
unsigned int lowest_bit;
unsigned int highest_bit;
unsigned int cluster_next;
unsigned int cluster_nr;
int prio; /* swap priority */
int pages;
unsigned long max;
int next; /* next entry on swap list */
};
-
All the
swap_info_struct
s in the running system are stored in a statically declared array, swap_info which holds MAX_SWAPFILES (defined as 32) - this means that at most 32 swap areas can exist on a running system. -
Looking at each field:
-
flags
- A bit field with 2 possible values - SWP_USED (0b01
) and SWP_WRITEOK (0b11
).SWP_USED
implies the swap area is currently active, andSWP_WRITEOK
when linux is ready to write to the area.SWP_WRITEOK
containsSWP_USED
as the former implies the latter must be the case. -
swap_device
- The device corresponding to the partition for this swap area. If the swap area is a file, this is set toNULL
. -
sdev_lock
- - Spinlock protecting the struct, most pertinentlyswap_map
. It's locked and unlocked via swap_device_lock() and swap_device_unlock(). -
swap_file
- The struct dentry for the actual special file that is mounted as a swap area, for example this file may exist in/dev
in the case that a partition is mounted. This field is needed to identify the correctswap_info_struct
when deactivating a swap area. -
swap_vfsmnt
- The struct vfsmount object corresponding to where the device or file for this swap area is located. -
swap_map
- A large array containing one entry for every swap entry or page-sized slot in the area. An 'entry' is a reference count of the number of users of the page slot, with the swap cache counting as one user and every PTE that has been paged out to the slot as the other users. If the entry is equal to SWAP_MAP_MAX, the slot is allocated permanently. If it's equal to SWAP_MAP_BAD, the slot will never be used. -
lowest_bit
- Lowest possible free slot available in the swap area and is used to start from when linearly scanning to reduce the search space - there are definitely no free slots below this mark. -
highest_bit
- Highest possible free slot available. Similar tolowest_bit
, there are definitely no free slots above this mark. -
cluster_next
- Offset of the next cluster of blocks to use. The swap area tries to have pages allocated in cluster blocks to increase the chance related pages will be stored together. -
cluster_nr
- Number of pages left to allocate in this cluster. -
prio
- The 'priority' of the swap area - this determines how likely the area is to be used. By default the priorities are arranged in order of activation, but the sysadmin may also specify it using the-p
flag of swapon. -
pages
- Because some slots on the swap file may be unusable, this field stores the number of usable pages in the swap area. This differs frommax
in that swaps marked SWAP_MAP_BAD are not counted. -
max
- Total number of slots in this swap area. -
next
- The index in the swap_info array of the next swap area in the system.
- The areas are not only stored in an array, they are also kept in a
struct swap_list_t 'pseudolist'
swap_list.
swap_list_t
is a simple type:
struct swap_list_t {
int head; /* head of priority-ordered swapfile list */
int next; /* swapfile to be used next */
};
-
head
is the index of the swap area of the highest priority swap area in use, andnext
is the index of the next swap area that should be used. -
This list enables areas to be looked up in order of priority when necessary but still remain easily looked up in the
swap_info
array. -
Each swap area is divided into a number of page-sized slots on disk (e.g. 4KiB each on i386.)
-
The first slot is always reserved because it contains information about the swap area that must not be overwritten, including the first 1KiB which stores a disk label for the partition that can be retrieved via userspace tools.
-
The remaining space in this initial slot is used for information about the swap area which is filled when the swap area is created with mkswap. This is represented by the union swap_header union:
/*
* Magic header for a swap area. The first part of the union is
* what the swap magic looks like for the old (limited to 128MB)
* swap area format, the second part of the union adds - in the
* old reserved area - some extra information. Note that the first
* kilobyte is reserved for boot loader or disk label stuff...
*
* Having the magic at the end of the PAGE_SIZE makes detecting swap
* areas somewhat tricky on machines that support multiple page sizes.
* For 2.5 we'll probably want to move the magic to just beyond the
* bootbits...
*/
union swap_header {
struct
{
char reserved[PAGE_SIZE - 10];
char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
} magic;
struct
{
char bootbits[1024]; /* Space for disklabel etc. */
unsigned int version;
unsigned int last_page;
unsigned int nr_badpages;
unsigned int padding[125];
unsigned int badpages[1];
} info;
};
- Looking at each of the fields:
-
reserved
- Dummy field used to positionmagic
correctly at the end of the page. -
magic
- Used for identifying the magic string that identifies a swap header - this is in place to ensure that a partition that is not a swap area will never be used by mistake, and to determine what version of swap area is to be used - if the string is"SWAP-SPACE"
, then it's version 1 of the swap file format. If it's"SWAPSPACE2"
, then version 2 will be used. -
bootbits
- Reserved area containing information about the partition such as the disk label, retrievable via userspace tools. -
version
- Version of the swap area layout. -
last_page
- Last usable page in the area. -
nr_badpages
- Known number of bad pages that exist in the swap area. -
padding
- Disk sectors are usually 512 bytes in size.version
,last_page
andnr_badpages
take up 12 bytes, so this field takes up the remaining 500 bytes to sector-alignbadpages
. -
badpages
- The remainder of the page is used to store the indices of up to MAX_SWAP_BADPAGES number of bad page slots. These are filled in by the mkswap userland tool if the-c
switch is specified to check the area.
MAX_SWAP_BADPAGES
is a compile-time constant that varies if the struct changes, but is 637 entries in its current form determined by:
MAX_SWAP_BADPAGES = (PAGE_SIZE - <bootblock size = 1024> - <padding size = 512> -
<magic string size = 10>)/sizeof(long)
-
When a page is swapped out, linux uses the corresponding PTE to store enough information to locate the page on disk again. Rather than storing the physical address of the page, the PTE contains this information and has the appropriate flags set to indicate that it is not an address.
-
Clearly, a PTE is not large enough to store precisely where on the disk the page is located, but it's more than enough to store an index into the swap_info array and an offset within the
swap_map
. -
Each PTE, regardless of architecture, is large enough to store a swp_entry_t:
/*
* A swap entry has to fit into a "unsigned long", as
* the entry is hidden in the "index" field of the
* swapper address space.
*
* We have to move it here, since not every user of fs.h is including
* mm.h, but mm.h is including fs.h via sched .h :-/
*/
typedef struct {
unsigned long val;
} swp_entry_t;
-
pte_to_swp_entry() and swp_entry_to_pte() are used to translate between PTEs and swap entries.
-
As always we'll focus on i386, however all architectures has to be able to determine if a PTE is present or swapped out. In the
swp_entry_t
two bits are always kept free - on i386 bit 0 is reserved for the _PAGE_PRESENT flag and bit 7 is reserved for _PAGE_PROTNONE (both discussed in 3.2.) -
Bits 1 to 6 are used for the 'type' which is the index within the swap_info array and is returned via SWP_TYPE().
-
Bits 8 through 31 (base-0 so all remaining bits) are used to the the offset within the
swap_map
from theswp_entry_t
. This means there are 24 bits available, limiting the swap area to 64GiB (2^24 * 4096
bytes.) -
The offset is extracted via SWP_OFFSET(). To encode a type and offset into a swp_entry_t, SWP_ENTRY() is available which does all needed bit-shifting.
-
Looking at the relationship between the various macros diagrammatically:
------- ---------- --------
| PTE | | Offset | | Type |
------- ---------- --------
| | |
| | |
v v v
---------------------- ---------------
| pte_to_swp_entry() | | SWP_ENTRY() | Bits reserved for
---------------------- --------------- _PAGE_PROTNONE _PAGE_PRESENT
| | | |
/---------------------/ | |
| BITS_PER_LONG 8|7 1|0
| ---------------------------------v--------------v-
\------>| Offset | | Type | |
--------------------------------------------------
swp_entry_t | |
| |
v v
---------------- --------------
| SWP_OFFSET() | | SWP_TYPE() |
---------------- --------------
| |
| |
v v
---------- --------
| Offset | | Type |
---------- --------
-
The six bits for type should allow up to 64 (2^6) swap areas to exist in a 32-bit architecture, however MAX_SWAPFILES is set at 32. This is due to the consumption of the
vmalloc
address space (see chapter 7 for more on vmalloc.) -
If a swap area is the maximum possible size, 32MiB is required for the
swap_map
(2^24 * sizeof(short)
) - remembering that each page uses oneshort
for the reference count. This means that ifMAX_SWAPFILES = 32
swaps exist, 1GiB of virtual malloc space is required, which is simply impossible given the user/kernel linear address space split. -
You'd think this would mean that supporting 64 swap areas is not worth the additional complexity, but this is not the case - if a system has multiple disks, it's worth having this complexity in order to distribute swap across all disks to allow for maximal parallelism and hence performance.
-
As discussed previously, all page-sized slots are tracked by the struct swap_info
->swap_map
unsigned short
array each of which act as a reference count (number of 'users' of the slot, with the swap cache acting as the first 'user' - a shared page can have a lot of these of course.) -
If the entry is SWAP_MAP_MAX the page is permanently reserved for that slot. This is unlikely but not impossible - it's designed to ensure the reference count does not overflow.
-
If the entry is SWAP_MAP_BAD, the slot is unusable.
-
Finding and allocating a swap entry is divided into two major tasks - the first is performed by get_swap_page():
-
Starting with swap_list
->next
, search swap areas for a suitable slot. -
Once a slot has been found, record what the next swap to be used will be and return the allocated entry.
- The second major task is is the searching itself, which is performed by scan_swap_map(). In principle it's very simple because it linearly scans the array for a free slot, however the implementation is little more involved than that:
static inline int scan_swap_map(struct swap_info_struct *si)
{
unsigned long offset;
/*
* We try to cluster swap pages by allocating them
* sequentially in swap. Once we've allocated
* SWAPFILE_CLUSTER pages this way, however, we resort to
* first-free allocation, starting a new cluster. This
* prevents us from scattering swap pages all over the entire
* swap partition, so that we reduce overall disk seek times
* between swap pages. -- sct */
if (si->cluster_nr) {
while (si->cluster_next <= si->highest_bit) {
offset = si->cluster_next++;
if (si->swap_map[offset])
continue;
si->cluster_nr--;
goto got_page;
}
}
si->cluster_nr = SWAPFILE_CLUSTER;
/* try to find an empty (even not aligned) cluster. */
offset = si->lowest_bit;
check_next_cluster:
if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
{
int nr;
for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
if (si->swap_map[nr])
{
offset = nr+1;
goto check_next_cluster;
}
/* We found a completly empty cluster, so start
* using it.
*/
goto got_page;
}
/* No luck, so now go finegrined as usual. -Andrea */
for (offset = si->lowest_bit; offset <= si->highest_bit ; offset++) {
if (si->swap_map[offset])
continue;
si->lowest_bit = offset+1;
got_page:
if (offset == si->lowest_bit)
si->lowest_bit++;
if (offset == si->highest_bit)
si->highest_bit--;
if (si->lowest_bit > si->highest_bit) {
si->lowest_bit = si->max;
si->highest_bit = 0;
}
si->swap_map[offset] = 1;
nr_swap_pages--;
si->cluster_next = offset+1;
return offset;
}
si->lowest_bit = si->max;
si->highest_bit = 0;
return 0;
}
-
Linux tries to organise pages into 'clusters' on disk of size SWAPFILE_CLUSTER. It allocates
SWAPFILE_CLUSTER
pages sequentially in swap, keeps count of the number of sequentially allocated pages in struct swap_info_struct->cluster_nr
and records the current offset inswap_info_struct->cluster_next
. -
After a sequential block has been allocated, it searches for a block of free entries of size SWAPFILE_CLUSTER. If a large enough block can be found, it will be used as another cluster-sized sequence.
-
If no free clusters large enough can be found in the swap area, a simple first-free search is performed starting from
swap_info_struct->lowest_bit
. -
The aim is to have pages swapped out at the same time close together, on the premise that pages swapped out together are likely to be related. This makes sense because the page replacement algorithm will use swap space most when linearly scanning the process address space.
-
Without scanning for large free blocks and using them, it is likely the scanning would degenerate to first-free searches and never improve, with it exiting processes are likely to free up large blocks of slots.
-
Pages that are shared between many processes cannot be easily swapped out because there's no quick way to map a struct page to every PTE that references it.
-
If special care wasn't taken this fact could lead to the rare condition where a page that is present for one PTE and swapped out for another gets updated without being synced to disk, thereby losing the update.
-
To address the problem, shared pages that have a reserved slot in backing storage are considered to be part of the 'swap cache'.
-
The swap cache is simply a specialisation of the page cache with the main difference between pages in the swap cache and the page cache being that pages in the swap cache always use swapper_space as their struct address_space in
page->mapping
. -
Another difference is that pages are added to the swap cache with add_to_swap_cache() instead of add_to_page_cache().
-
Taking a look at the swap cache API:
-
get_swap_page() - Allocates a slot in a
swap_map
by searching active swap areas. Covered in more detail in 11.3. -
add_to_swap_cache() - Adds a page to the swap cache - first checking to see whether it already exists by calling
swap_duplicate()
and if not adding it to the swap cache via add_to_page_cache_unique(). -
lookup_swap_cache() - Searches the swap cache and returns the struct page corresponding to the specified swap entry. It works by searching the normal page cache based on swapper_space and the
swap_map
offset. -
swap_duplicate() - Verifies a swap entry is valid, and if so, increments its swap map count.
-
swap_free() - The complement to
swap_duplicate()
. Decrements the relevant counter in theswap_map
. Once this count reaches zero, the slot is effectively free.
-
Anonymous pages are not part of the swap cache until an attempt is made to swap them out.
-
Taking a look at swapper_space:
struct address_space swapper_space = {
LIST_HEAD_INIT(swapper_space.clean_pages),
LIST_HEAD_INIT(swapper_space.dirty_pages),
LIST_HEAD_INIT(swapper_space.locked_pages),
0, /* nrpages */
&swap_aops,
};
-
A page is defined as being part of the swap cache when its struct page
->mapping
field has been set toswapper_space
. This is determined by PageSwapCache(). -
Linux uses the exact same code for keeping pages between swap and memory sync as it uses for keeping file-backed pages and memory in sync - they both share the page cache code, the differences being in the functions used.
-
The address space for backing storage,
swapper_space
, uses swap_aops in its struct address_space->a_ops
. -
The
page->index
field is then used to store the swp_entry_t structure instead of a file offset (its usual purpose.) -
The struct address_space_operations struct
swap_aops
is defined as follows, using swap_writepage() and block_sync_page():
static struct address_space_operations swap_aops = {
writepage: swap_writepage,
sync_page: block_sync_page,
};
-
When a page is being added to the swap cache, a slot is allocated with get_swap_page(), added to the page cache with add_to_swap_cache() and then marked dirty.
-
When the page is next laundered, it will actually be written to backing storage on disk as the normal page cache would operate. Diagrammatically:
Anonymous
struct page
-----------------
/ /
\ ... \
/ /
|---------------|
| mapping=NULL |
|---------------| ---------------------
/ / | try_to_swap_out() |
\ ... \------>| attempts to swap |
/ / | pages out from |
|---------------| | process space |
| count=5 | ---------------------
|---------------| |
/ / |
\ ... \ v
/ / ---------------------
----------------- | get_swap_page() |
| allocates slot in |-----------------\
| backing storage | |
--------------------- | -----
| | | |
| | |---|
v \-> | 1 |
----------------------------------- |---|
| add_to_swap_cache() | / /
| adds the page to the page cache | \ \
| using swapper_space as a | / /
| struct address_space. A dirty | -----
| page will now sync to swap | Swap map
-----------------------------------
| /---------\
| | |
v | v
Anonymous | ---------
struct page | | | |
----------------- | ----|----
/ / | v
\ ... \ | ---------
/ / | | | |
|---------------| | ----|----
| mapping = | | v
| swapper_space | | ---------
|---------------| | | |
/ / | ---------
\ ... \---------------/ Page Cache
/ / (inactive_list)
|---------------|
| count=4 |
|---------------|
/ /
\ ... \
/ /
-----------------
-
Subsequent swapping of the page from shared PTEs results in a call to swap_duplicate() which simply increments the reference to the slot in the
swap_map
. -
If the PTE is marked dirty by the hardware as the result of a write, the bit is cleared and its struct page is marked dirty via set_page_dirty() so the on-disk copy will be synced before the page is dropped. This ensures that until all references to a page have been dropped, a check will be made to ensure the data on disk matches the data in the page frame.
-
When the reference count to the page reaches 0, the page is eligible to be dropped from the page cache and the swap map count will equal the count of the number of PTEs the on-disk slot belongs to so the slot won't be freed prematurely. It is laundered then finally dropped with the same LRU ageing and logic described in chapter 10.
-
If, on the other hand, a page fault occurs for a page that is swapped out, do_swap_page() checks to see if the page exists in the swap cache by calling lookup_swap_cache() - if it exists, the PTE is updated to point to the page frame, the page reference count is incremented and the swap slot is decremented with swap_free().
- The principal function used for reading in pages is read_swap_cache_async() which is mainly called during page faulting:
/*
* Locate a page of swap in physical memory, reserving swap cache space
* and reading the disk if it is not already cached.
* A failure return means that either the page allocation failed or that
* the swap entry is no longer in use.
*/
struct page * read_swap_cache_async(swp_entry_t entry)
{
struct page *found_page, *new_page = NULL;
int err;
do {
/*
* First check the swap cache. Since this is normally
* called after lookup_swap_cache() failed, re-calling
* that would confuse statistics: use find_get_page()
* directly.
*/
found_page = find_get_page(&swapper_space, entry.val);
if (found_page)
break;
/*
* Get a new page to read into from swap.
*/
if (!new_page) {
new_page = alloc_page(GFP_HIGHUSER);
if (!new_page)
break; /* Out of memory */
}
/*
* Associate the page with swap entry in the swap cache.
* May fail (-ENOENT) if swap entry has been freed since
* our caller observed it. May fail (-EEXIST) if there
* is already a page associated with this entry in the
* swap cache: added by a racing read_swap_cache_async,
* or by try_to_swap_out (or shmem_writepage) re-using
* the just freed swap entry for an existing page.
*/
err = add_to_swap_cache(new_page, entry);
if (!err) {
/*
* Initiate read into locked page and return.
*/
rw_swap_page(READ, new_page);
return new_page;
}
} while (err != -ENOENT);
if (new_page)
page_cache_release(new_page);
return found_page;
}
- This does the following:
-
The function starts by searching the swap cache via find_get_page(). It doesn't use the usual lookup_swap_cache() swap cache search function as that updates statistics on number of searches performed however we are likely to search multiple times so
find_get_page()
makes more sense here. -
If the page isn't already in the swap cache, allocate one using alloc_page() and add it to swap cache using add_to_swap_cache(). If, however, a page is found, skip to step 5.
-
If the page cannot be added to the swap cache, it'll be searched again in case another process put the data there in the meantime.
-
The data is read from backing storage via rw_swap_page() (discussed in 11.7), and returned to the user.
-
If the page was found in the cache and didn't need allocating,page_cache_release() is called to decrement the reference count accumulated via
find_get_page()
.
-
When any page is being written to disk the struct address_space
->a_ops
field is used to determine the appropriate 'write-out' function. -
In the case of backing storage, the
address_space
is swapper_space and the swap operations are contained in swap_aops which uses swap_writepage() for its write-out function. -
swap_writepage()
behaves differently depending on whether the writing process is the last user of the swap cache page or not - it determines this via remove_exclusive_swap_page() which simply checks the page count with pagecache_lock held - if no other process is mapping the page it is removed from the swap cache and freed. -
If
remove_exclusive_swap_page()
removed the page from the swap cache and freed it, swap_writepage() unlocks the page because it is no longer in use. -
If the page still exists in the swap cache, rw_swap_page() is called which writes the data to the backing storage.
-
The top-level function for reading/writing to the swap area is rw_swap_page() which ensures that all operations are performed through the swap cache to prevent lost updates.
-
rw_swap_page()
invokes rw_swap_page_base() in turn that does the heavy lifting:
/*
* Reads or writes a swap page.
* wait=1: start I/O and wait for completion. wait=0: start asynchronous I/O.
*
* Important prevention of race condition: the caller *must* atomically
* create a unique swap cache entry for this swap page before calling
* rw_swap_page, and must lock that page. By ensuring that there is a
* single page of memory reserved for the swap entry, the normal VM page
* lock on that page also doubles as a lock on swap entries. Having only
* one lock to deal with per swap entry (rather than locking swap and memory
* independently) also makes it easier to make certain swapping operations
* atomic, which is particularly important when we are trying to ensure
* that shared pages stay shared while being swapped.
*/
static int rw_swap_page_base(int rw, swp_entry_t entry, struct page *page)
{
unsigned long offset;
int zones[PAGE_SIZE/512];
int zones_used;
kdev_t dev = 0;
int block_size;
struct inode *swapf = 0;
if (rw == READ) {
ClearPageUptodate(page);
kstat.pswpin++;
} else
kstat.pswpout++;
get_swaphandle_info(entry, &offset, &dev, &swapf);
if (dev) {
zones[0] = offset;
zones_used = 1;
block_size = PAGE_SIZE;
} else if (swapf) {
int i, j;
unsigned int block = offset
<< (PAGE_SHIFT - swapf->i_sb->s_blocksize_bits);
block_size = swapf->i_sb->s_blocksize;
for (i=0, j=0; j< PAGE_SIZE ; i++, j += block_size)
if (!(zones[i] = bmap(swapf,block++))) {
printk("rw_swap_page: bad swap file\n");
return 0;
}
zones_used = i;
dev = swapf->i_dev;
} else {
return 0;
}
/* block_size == PAGE_SIZE/zones_used */
brw_page(rw, page, dev, zones, block_size);
return 1;
}
- Looking at how the function operates:
-
It checks if the operation is a read - if so, it clears the struct page
->uptodate
flag via ClearPageUptodate() because the page is clearly not up to date if I/O is required to fill it with data (the flag is set again if the page is successfully read from disk.) -
The device for the swap partition of the inode for the swap file is acquired via get_swaphandle_info() - this information is needed by the block layer which will be performing the actual I/O.
-
If the swap area is a file bmap() is used to fill a local array with a list of all blocks in the filesystem that contain the page data. If the backing storage is a partition, only one page-sized block requires I/O and since no filesystem is involved
bmap()
is not required. -
Either a swap partition or files can be used because
rw_swap_page_base()
uses brw_page() to perform the actual disk I/O which can handle both cases generically. -
All I/O that is performed is asynchronous so the function returns quickly - after the I/O is complete the block layer will unlock the page and any waiting process will wake up.
-
Activating a swap area is conceptually quite simple - open the file, load the header information from disk, populate a struct swap_info_struct, and add that to the swap list.
-
sys_swapon() is the function that activates the swap via a syscall -
long sys_swapon(const char * specialfile, int swap_flags)
. It takes the path to the special file (quite possibly a device) and some swap flags. -
Since this is 2.4.22 :) the 'Big Kernel Lock' (BKL) is held during the process, preventing any application from entering kernel space while the operation is being performed.
-
The function operates as follows:
-
Find a free
swap_info_struct
in the swap_info array and initialise it with default values. -
Call user_path_walk() (and subsesquently __user_walk()) to traverse the directory tree for the specified
specialfile
and populates a struct nameidata with the available data on the file, such as the struct dentry data and struct vfsmount data on where the file is stored. -
Populates the
swap_info_struct
fields relating to the dimensions of the swap area and how to find it. If the swap area is a partition, the block size will be aligned to the PAGE_SIZE before calculating the size. If it is a file, the information is taken directly from the struct inode. -
Ensures the area is not already activated. If it isn't, it allocates a page from memory and reads the first page-sized slot from the swap area which contains information about the number of good slots and how to populate the swap map with bad entries (see 11.1 for more details.)
-
Allocate memory with vmalloc() for
swap_info_struct->swap_map
and initialise each entry with 0 for good slots and SWAP_MAP_BAD otherwise (ideally the header will indicate v2 as v1 was limited to swap areas of just under 128MiB for architectures with a 4KiB page size like x86.) -
After ensuring the header is valid fill in the remaining
swap_info_struct
fields like the maximum number of pages and the available good pages, then update the global statistics for nr_swap_pages and total_swap_pages. -
The swap area is not fully active and initialised so insert it into the swap list in the correct position based on the priority of the newly activated area.
-
Release the BKL.
-
In comparison to activating a swap area, deactivation is incredibly expensive. The main reason for this is that the swap can't simply be removed - every page that was swapped out has to now be swapped back in again.
-
Just as there's no quick way to map a struct page to every PTE that references it, there is no quick way to map a swap entry to a PTE either - it requires all process page tables be traversed to find PTEs that reference the swap area that is being deactivated and swap them all in.
-
Of course all this means swap deactivation will fail if the physical memory is not available.
-
The function concerned with deactivating an area is sys_swapoff(). The function is mostly focused on updating the struct swap_info_struct, leaving the heavy lifting of paging in each paged-out page via try_to_unuse(), which is extremely expensive.
-
For each slot used in the
swap_info_struct
'sswap_map
field, the page tables have to be traversed search for it, and in the worst case all page tables belonging to all struct mm_structs have to be traversed. -
Broadly speaking
sys_swapoff()
performs the following:
-
Call user_path_walk() to retrieve information about the special file to be deactivated, then take the BKL.
-
Remove the
swap_info_struct
from the swap list and update the global statistics on the number of swap pages available(nr_swap_pages) and the total number of swap entries (total_swap_pages.) Once complete, the BKL can be released. -
Call try_to_unuse() which will page in all pages from the swap area to be deactivated. The function loops through the swap map using find_next_to_unuse() to locate the next used swap slot (see below for more details on what
try_to_unuse()
does.) -
If there was not enough available memory to page in all the entries, the swap area is reinserted back into the running system because it cannot simply be dropped. If, on the other hand, the process succeeded, the
swap_info_struct
is placed into an uninitialised state and theswap_map
memory is freed with vfree().
- try_to_unuse() does the following:
-
Call read_swap_cache_async() to allocate a page for the slot saved on disk. Ideally, it exists in the swap cache already, but the page allocator will be called if it isn't.
-
Wait on the page to be fully paged in and lock it. Once looked, call unuse_process() for every process that has a PTE referencing the page. This function traverses the page table searching for the relevant PTE then updates it to point to the correct struct page.If the page is a shared memory page with no remaining reference, shmem_unuse() is called instead.
-
Free all slots that were permanently mapped. It's believed that slots will never become permanently reserved, so a risk is taken here.
-
Delete the page from the swap cache to prevent try_to_swap_out() from referencing a page in the event it still somehow has a reference in swap from a swap map.