Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix earlyoom killing processes too early when ZFS is in use #191

Closed
wants to merge 1 commit into from

Conversation

nh2
Copy link
Contributor

@nh2 nh2 commented Apr 25, 2020


The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to MemAvailable (see openzfs/zfs#10255), leading earlyoom to believe we are out of memory when we still have a lot of memory available (in practice, many GBs).

Thus, until now, earlyoom tended to kill processes on ZFS systems even though there was no memory pressure.

This commit fixes it by adding the size field of /proc/spl/kstat/zfs/arcstats to MemAvailable.

The effect can be checked easily on ZFS systems:

Before this commit, dropping the ARC via (command from 1)

echo 3 | sudo tee /proc/sys/vm/drop_caches

would result in an increase of free memory in earlyoom's output; with this fix, it stays equal.

@nh2
Copy link
Contributor Author

nh2 commented Apr 25, 2020

The Codacity report shows 2 issues:

The scope of the variable 'arcstats_buf' can be reduced.

Your choice, should I move it into the if block, or do you prefer it to remain at the top to indicate the memory use of the function?

Does not handle strings that are not \0-terminated; if given one it may perform an over-read (it could cause a crash if unprotected) (CWE-126).

int matches = sscanf(hit + strlen(search_term), " %*u %lld", &zfs_arcstats_bytes);

Not sure if that is addressable, see e.g. from https://stackoverflow.com/questions/18368712/since-we-have-snprintf-why-we-dont-have-a-snscanf/18368925#18368925

It would be nice/useful if you could use %*s as you can with snprintf(), but you can't — in sscanf(), the * means 'do not assign scanned value', not the length.

@hakavlad
Copy link
Contributor

Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking at c_min as tuned by zfs_arc_min.

https://github.com/openzfs/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_min

Maybe we should subtract c_min from size.

@nh2
Copy link
Contributor Author

nh2 commented Apr 26, 2020

Maybe we should subtract c_min from size.

That sounds like a good idea.

@rfjakob
Copy link
Owner

rfjakob commented Apr 26, 2020

I have merged everything but 64feba5 , for which I have a few requests. I think the easiest for you now is to check out latest master and cherry-pick 64feba5 on top, then amend.

@@ -56,17 +57,25 @@ static long long available_guesstimate(const char* buf)
return MemFree + Cached + Buffers - Shmem;
}

/* Parse /proc/meminfo.
/* Parse /proc/meminfo and other related files:
* * ZFS's /proc/spl/kstat/zfs/arcstats, if it exists
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a separate function (parse_arcstats?) please

Copy link
Contributor Author

@nh2 nh2 Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

meminfo.c Outdated
size_t len = fread(buf, 1, sizeof(buf) - 1, fd);
if (len == 0) {
fatal(102, "could not read /proc/meminfo: 0 bytes returned\n");
// Loop to handle short reads.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove retry loop, glibc handles this internally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted as per #189 (comment)

meminfo.c Outdated
* This function either returns valid data or kills the process
* with a fatal error.
*/
meminfo_t parse_meminfo()
{
static FILE* fd;
static FILE* arcstats_fd;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should keep this fd always-open. zfs is a kernel module that can be unloaded, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I've changed it and added a comment.

meminfo.c Outdated
// hits 4 373259415
// ...
// size 4 1721339808
// We scan for the "size" line and parse the second number.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is possible to use

get_entry("size                            4", arcstats_buf)

here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether the number of spaces here is guaranteed to be constant. It seems to be "suspiciously enough" to fit the length of all parameters. So far I have not found a specification that discussed this.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm ok, I agree, too risky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment about that.

@nh2
Copy link
Contributor Author

nh2 commented Apr 28, 2020

I've addressed the feedback, and implemented subtraction of c_min and arc_meta_min, please take another look.


char* hit = strstr(buf, search_term);
if (hit == NULL) {
warn("parse_meminfo: arcstats does not contain size field\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn("get_arcstats_entry: arcstats does not contain field %s", search_term);

// ` ` skips spaces, `%*u` ignores the `type` field.
int matches = sscanf(hit + strlen(search_term), " %*u %lld", &result);
if (matches < 1) {
warn("parse_meminfo: unexpected /proc/spl/kstat/zfs/arcstats contents in size line\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn("get_arcstats_entry: ..... %s", search_term);

if (fd == NULL && errno != ENOENT) {
fatal(106, "could not open /proc/spl/kstat/zfs/arcstats: %s\n", strerror(errno));
}
if (fd != NULL) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invert the if condition to get rid of one level of indentation:

if (fd == NULL) {
return 0;
}

107: Could read /proc/spl/kstat/zfs/arcstats

108: Could not parse /proc/spl/kstat/zfs/arcstats contents

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatal error is too much here. Keep in mind that this depends on the behavoir of a debug file of an out-of-tree kernel module

FILE* fd = NULL;
fd = fopen("/proc/spl/kstat/zfs/arcstats", "r");
if (fd == NULL && errno != ENOENT) {
fatal(106, "could not open /proc/spl/kstat/zfs/arcstats: %s\n", strerror(errno));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning is ok, but no fatal error


size_t len = fread(arcstats_buf, 1, sizeof(arcstats_buf) - 1, fd);
if (ferror(fd)) {
fatal(107, "could not read /proc/spl/kstat/zfs/arcstats: %s\n", strerror(errno));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning is ok, but no fatal error

fatal(107, "could not read /proc/spl/kstat/zfs/arcstats: %s\n", strerror(errno));
}
if (len == 0) {
fatal(108, "could not read /proc/spl/kstat/zfs/arcstats: 0 bytes returned\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning is ok, but no fatal error

// * `arc_meta_min`: https://github.com/openzfs/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_meta_min
long long arcstats_c_min_bytes = get_arcstats_entry("\nc_min ", arcstats_buf);
if (arcstats_c_min_bytes > 0) {
zfs_available_bytes -= arcstats_c_min_bytes;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value can become negative! You have to check

zfs_available_bytes > arcstats_c_min_bytes

first. See https://gist.github.com/rfjakob/0a674048b9efd4b985a70107d64f1110 for example when it would get negative

}
long long arcstats_arc_meta_min_bytes = get_arcstats_entry("\narc_meta_min ", arcstats_buf);
if (arcstats_arc_meta_min_bytes > 0) {
zfs_available_bytes -= arcstats_arc_meta_min_bytes;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that the value does not become negative

@hakavlad
Copy link
Contributor

hakavlad commented May 1, 2020

I've addressed the feedback, and implemented subtraction of c_min and arc_meta_min

Maybe add arc_meta_used.

I saw size < c_min in some cases.

My suggestion:

zfs_c_reclaimable = size - c_min

if zfs_c_reclaimable < 0:
    zfs_c_reclaimable = 0

arc_meta_reclaimable = arc_meta_used - arc_meta_min

if arc_meta_reclaimable < 0:
    arc_meta_reclaimable = 0

zfs_reclaimable = zfs_c_reclaimable + arc_meta_reclaimable

@rfjakob
Copy link
Owner

rfjakob commented May 1, 2020

Could you post the /proc/spl/kstat/zfs/arcstats of an active system? Maybe arc_meta_used is so small compared to size that it does not matter?

The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache.
However, in contrast to the buffer cache, it currently does not count
to `MemAvailable` (see openzfs/zfs#10255),
leading earlyoom to believe we are out of memory when we still have
a lot of memory available (in practice, many GBs).

Thus, until now, earlyoom tended to kill processes on ZFS systems
even though there was no memory pressure.

This commit fixes it by adding the `size` field of
`/proc/spl/kstat/zfs/arcstats` to `MemAvailable`.

The effect can be checked easily on ZFS systems:

Before this commit, dropping the ARC via (command from [1])

    echo 3 | sudo tee /proc/sys/vm/drop_caches

would result in an increase of free memory in earlyoom's output;
with this fix, it stays equal.

[1]: https://serverfault.com/a/857386/128321
@hakavlad
Copy link
Contributor

hakavlad commented May 2, 2020

Maybe arc_meta_used is so small

On the contrary, the use of arc_meta_used allows you to get a result close to the expected.

12 1 0x01 98 26656 1920163911 654183491814
name                            type data
hits                            4    1115656
misses                          4    87670
demand_data_hits                4    123358
demand_data_misses              4    60050
demand_metadata_hits            4    925838
demand_metadata_misses          4    15271
prefetch_data_hits              4    342
prefetch_data_misses            4    2081
prefetch_metadata_hits          4    66118
prefetch_metadata_misses        4    10268
mru_hits                        4    221708
mru_ghost_hits                  4    0
mfu_hits                        4    827620
mfu_ghost_hits                  4    0
deleted                         4    25
mutex_miss                      4    0
access_skip                     4    0
evict_skip                      4    49
evict_not_enough                4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    250880
evict_l2_ineligible             4    8192
evict_l2_skip                   4    0
hash_elements                   4    87727
hash_elements_max               4    87727
hash_collisions                 4    7059
hash_chains                     4    6542
hash_chain_max                  4    4
p                               4    1031851008
c                               4    2063702016
c_min                           4    128981376
c_max                           4    2063702016
size                            4    1550409072
compressed_size                 4    961233408
uncompressed_size               4    1935520256
overhead_size                   4    287200256
hdr_size                        4    30435432
data_size                       4    919345152
metadata_size                   4    329088512
dbuf_size                       4    46606248
dnode_size                      4    117981792
bonus_size                      4    106951936
anon_size                       4    65536
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    805175296
mru_evictable_data              4    653491200
mru_evictable_metadata          4    24316416
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    443192832
mfu_evictable_data              4    187506688
mfu_evictable_metadata          4    9936384
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    4127404032
memory_free_bytes               4    1406578688
memory_available_bytes          3    1342091264
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    631063920
arc_meta_limit                  4    1547776512
arc_dnode_limit                 4    154777651
arc_meta_max                    4    632715624
arc_meta_min                    4    16777216
async_upgrade_sync              4    676
demand_hit_predictive_prefetch  4    3496
demand_hit_prescient_prefetch   4    0
arc_need_free                   4    0
arc_sys_free                    4    64490688
arc_raw_size                    4    0

In this case (values in MiB):

MemTotal: 3936
MemAvailable: 1272
size: 1479
c_min: 123
arc_meta_used: 602
arc_meta_min: 16
zfs_reclaimable: 1943 (supposed)
NewMemAvailable: 3214

Monitor: https://github.com/hakavlad/nohang-extra/blob/master/zfs/m04

@rfjakob
Copy link
Owner

rfjakob commented May 2, 2020

I made a spreadsheet of how things look before and after 3 > /proc/sys/vm/drop_caches. In free I see 601 MB more available memory. But nothing I see in arcstats explains these 601 MB.

https://docs.google.com/spreadsheets/d/1ZVtcdkoZqsOAAEK20wd_h9mOvrNOYN-AwmsM7Ma6iMk/edit?usp=sharing

Raw source data: https://gist.github.com/rfjakob/272691aa61798c134410a69a323cc5d7

@hakavlad
Copy link
Contributor

hakavlad commented May 4, 2020

This bug should be mentioned in the earlyoom documentation.

Copy link
Owner

@rfjakob rfjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No fatal errors please. Keep in mind that this depends on the behavoir of a debug file of an out-of-tree kernel module

@rfjakob
Copy link
Owner

rfjakob commented Aug 16, 2020

Closing after 3 months of inactivity after review, please reopen when ready

@rfjakob rfjakob closed this Aug 16, 2020
@nh2
Copy link
Contributor Author

nh2 commented Aug 18, 2020

I don't currently have time to finish this, would appreciate if somebody else could take it over.

@markusressel
Copy link

Kinda bummed this PR is stuck after so much work has already been put into it 😢
I really like earlyoom, using it on all my machines, but without it respecting the ARC it is sometimes killing applications like crazy even though only half of the ram is used (12GB total in my case). If I was more experienced with C I would jump on it immediately.

The one thing I can add is that I use this bash command:

echo $(awk '$1 == "size" {print $NF}' /proc/spl/kstat/zfs/arcstats) $(awk '$1 == "c" {print $NF}' /proc/spl/kstat/zfs/arcstats) | awk '{print int(($1 / $2) * 100) "%"}'

in my polybar setup for months now to indicate the usage of the ARC and it has been working flawless since the beginning.

The command I use to display the total ram usage (excluding ARC) is this:

echo $(free -b | awk 'NR==2 {print $2 " " $7}') $(awk '$1 == "size" {print $NF}' /proc/spl/kstat/zfs/arcstats) | awk '{print int(((($1 - $2) - $3) / $1) * 100) "%"}'

If these commands should fail, I would notice immediately. Of course that really depends on the setup a user might have, but I would argue its better than having my IDE killed at 55% ram usage or freezing the entire system because of extremely low memory...

I am really hoping someone with more skill than me is willing to join this ❤️

@hakavlad
Copy link
Contributor

@markusressel you could use nohang [1] as temporary solution. It tries to respect [2] ARC.

[1] https://github.com/hakavlad/nohang
[2] hakavlad/nohang@16f7db1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants