-
The Out-of-Memory (OOM) manager is pretty straightforward as it has one simple task: check if there is enough memory to satisfy requests, if not verify the system is truly out of memory and if so, select a process to kill.
-
The OOM killer is a controversial part of the VM and there has been much discussion about removing it but yet it remains.
-
For certain operations such as expanding the heap with brk() or remapping an address space with mremap(), the system will check if there is enough available memory to satisfy a request. Note that this is separate from the out_of_memory() path that is covered in the next section, rather it is an attempt to avoid the system being in a state of OOM if at all possible.
-
When checking available memory, the number of requested pages is passed as a parameter to vm_enough_memory(). Unless the sysadmin has specified that the system should overcommit memory, the amount of memory will be checked.
-
To determine how many pages are potentially available, linux sums up the following:
-
Total page cache - Easily reclaimed.
-
Total free pages - They are already available.
-
Total free swap pages - Because userspace processes may be paged out.
-
Total pages managed by swapper_space - This double-counts free swap pages, but is somewhat mitigated by the fact that slots are sometimes reserved but not used.
-
Total pages used by the struct dentry cache - Easily reclaimed.
-
Total pages used by the struct inode cache - Easily reclaimed.
- If the total number of pages added here is sufficient for the request,
vm_enough_memory() returns true otherwise it returns false
and
-ENOMEM
is returned to userspace.
-
When the machine is low on memory, old page frames wil be reclaimed (see chapter 10) but, despite reclaiming pages, it may find it was unable to free enough to satisfy a request even when scanning at highest priority.
-
If the system does fail to free page frames, out_of_memory() is called to see if the system is actually out of memory and if it is, kills a process:
/**
* out_of_memory - is the system out of memory?
*/
void out_of_memory(void)
{
static unsigned long first, last, count, lastkill;
unsigned long now, since;
/*
* Enough swap space left? Not OOM.
*/
if (nr_swap_pages > 0)
return;
now = jiffies;
since = now - last;
last = now;
/*
* If it's been a long time since last failure,
* we're not oom.
*/
last = now;
if (since > 5*HZ)
goto reset;
/*
* If we haven't tried for at least one second,
* we're not really oom.
*/
since = now - first;
if (since < HZ)
return;
/*
* If we have gotten only a few failures,
* we're not really oom.
*/
if (++count < 10)
return;
/*
* If we just killed a process, wait a while
* to give that task a chance to exit. This
* avoids killing multiple processes needlessly.
*/
since = now - lastkill;
if (since < HZ*5)
return;
/*
* Ok, really out of memory. Kill something.
*/
lastkill = now;
oom_kill();
reset:
first = now;
count = 0;
}
- The reason there are a series of checks here to see whether the system is out of memory is that the system may just be waiting for I/O to complete or pages to be swapped out to backing storage or some other similar condition - given this, we want to avoid killing a process as much as we can which is why there are checks in the first instance.
- select_bad_process() determines the process to kill by stepping through each running task and calculating how suitable it is for killing via the function badness(), which determines this via:
badness_for_task = total_vm_for_task/(cpu_time_in_seconds^0.5 * cpu_time_in_minutes^0.25)
-
The square roots are approximated by int_sqrt().
-
This has been chosen to prefer a process that is using a large amount of memory but is not that long-lived. Long-lived processes are unlikely to be the cause of memory shortage.
-
If the process is a root process or has
CAP_SYS_ADMIN
capabilities, the points are divided by 4 since it is assumed that privileged processes are well-behaved. -
Similarly, if the process has
CAP_SYS_RAWIO
capabilities (access to raw device), the points are further divided by 4 because it's not a good idea to kill a process that has direct access to hardware.
-
After a task is selected, the list is walked again and each process that shares the same struct mm_struct as the selected process (i.e. threads) is sent a signal.
-
If the process has
CAP_SYS_RAWIO
capabilities, aSIGTERM
signal is sent to give the process a chance of exiting cleanly. Otherwise aSIGKILL
is sent.
- Yep :)