Systemd: Restart on OOM #3611

Dreamsorcerer · 2024-11-05T12:48:56Z

Description

After an OOM kill, the process should be restarted by systemd. Prior to this change, that did not happen.

The comment in the file says it doesn't use on-failure in case of config errors, which I assume is caused by an unclean exit code. on-abnormal is the same as on-failure except for unclean exit code.

Release Notes

Changed systemd Restart mode to ensure that the server is restarted in abnormal situations such as OOM.

CLAassistant · 2024-11-05T12:49:04Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Dreamsorcerer · 2024-11-05T15:10:57Z

OK, I have no familiarity with the tests here. Feel free to take it over, create a new PR, or give me some guidance to complete it.

grooverdan · 2024-11-06T00:06:35Z

Thanks for contribution @Dreamsorcerer. For testing reference which systemd version are you using?

There's no real systemd mtr tests at the moment but we do have basics ones elsewhere (but we'll take care of those).

docs for ref is: https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#Restart=

An OOM is a sigkill, which should be a unclean signal, so I was assuming it would restart in the current state, but I assumed you've tested and the behaviour is different? So on-abnormal adds Timeout and Watchdog (unused). Is the OOM happening in startup hence its hitting the Timeout criteria?

At one point we where relying on systemd to prevent dual processes running (but now there's mechanisms in MariaDB (MDEV-31568 and systemd v242 systemd/systemd#11457).

Can you show the systemd around the unit logs? It might possible my systemd change above has caused systemd to not to restart as the OOM killed mariadbd process is in a defunct/zombie state. If this is the case, changing this setting won't help. I'll need to construct a local test case to test this properly.

Aside MDEV-34753, now that its fixed in yesterdays release, should avoid some OOM conditions if there is only transient memory pressure.

Dreamsorcerer · 2024-11-06T12:24:00Z

So on-abnormal adds Timeout and Watchdog (unused).

I'm using Debian stable, whatever packages are with those. I think from the logs, it was the watchdog that OOM killed the process, and that's why on-abnormal seems to work in restarting the service.

From journactl:

Oct 18 10:24:10 sam-server systemd[1]: mariadb.service: A process of this unit >
░░ Subject: A process of mariadb.service unit has been killed by the OOM killer.
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A process of unit @UNIT has been killed by the Linux kernel out-of-memory (O>
░░ killer logic. This usually indicates that the system is low on memory and th>
░░ memory needed to be freed. A process associated with mariadb.service has bee>
░░ as the best process to terminate and has been forcibly terminated by the
░░ kernel.
░░ 
░░ Note that the memory pressure might or might not have been caused by mariadb>
Oct 18 10:24:10 sam-server systemd[1]: mariadb.service: Main process exited, co>
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit mariadb.service has exited.
░░ 
░░ The process' exit code is 'killed' and its exit status is 9.
Oct 18 10:24:11 sam-server systemd[1]: mariadb.service: Failed with result 'oom>
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit mariadb.service has entered the 'failed' state with result 'oom-kil>
Oct 18 10:24:11 sam-server systemd[1]: mariadb.service: Consumed 15min 52.379s >
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit mariadb.service completed and consumed the indicated resources.
Oct 18 10:24:16 sam-server systemd[1]: mariadb.service: Scheduled restart job, >
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ Automatic restarting of the unit mariadb.service has been scheduled, as the >
░░ the configured Restart= setting for the unit.

Dreamsorcerer · 2024-11-06T12:26:13Z

Automatic restarting (i.e. the last log) does not happen with on-abort. From the logs, it's clear that systemd knows this has been oom-killed, rather than just knowing the exit code.

Systemd: Restart on OOM

b029087

Update [email protected]

24deab7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systemd: Restart on OOM #3611

Systemd: Restart on OOM #3611

Dreamsorcerer commented Nov 5, 2024

CLAassistant commented Nov 5, 2024

Dreamsorcerer commented Nov 5, 2024

grooverdan commented Nov 6, 2024

Dreamsorcerer commented Nov 6, 2024

Dreamsorcerer commented Nov 6, 2024

Systemd: Restart on OOM #3611

Are you sure you want to change the base?

Systemd: Restart on OOM #3611

Conversation

Dreamsorcerer commented Nov 5, 2024

Description

Release Notes

CLAassistant commented Nov 5, 2024

Dreamsorcerer commented Nov 5, 2024

grooverdan commented Nov 6, 2024

Dreamsorcerer commented Nov 6, 2024

Dreamsorcerer commented Nov 6, 2024