Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systemd: Restart on OOM #3611

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

Dreamsorcerer
Copy link

Description

After an OOM kill, the process should be restarted by systemd. Prior to this change, that did not happen.

The comment in the file says it doesn't use on-failure in case of config errors, which I assume is caused by an unclean exit code. on-abnormal is the same as on-failure except for unclean exit code.

Release Notes

Changed systemd Restart mode to ensure that the server is restarted in abnormal situations such as OOM.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Dreamsorcerer
Copy link
Author

OK, I have no familiarity with the tests here. Feel free to take it over, create a new PR, or give me some guidance to complete it.

@grooverdan
Copy link
Member

Thanks for contribution @Dreamsorcerer. For testing reference which systemd version are you using?

There's no real systemd mtr tests at the moment but we do have basics ones elsewhere (but we'll take care of those).

docs for ref is: https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#Restart=

An OOM is a sigkill, which should be a unclean signal, so I was assuming it would restart in the current state, but I assumed you've tested and the behaviour is different? So on-abnormal adds Timeout and Watchdog (unused). Is the OOM happening in startup hence its hitting the Timeout criteria?

At one point we where relying on systemd to prevent dual processes running (but now there's mechanisms in MariaDB (MDEV-31568 and systemd v242 systemd/systemd#11457).

Can you show the systemd around the unit logs? It might possible my systemd change above has caused systemd to not to restart as the OOM killed mariadbd process is in a defunct/zombie state. If this is the case, changing this setting won't help. I'll need to construct a local test case to test this properly.

Aside MDEV-34753, now that its fixed in yesterdays release, should avoid some OOM conditions if there is only transient memory pressure.

@Dreamsorcerer
Copy link
Author

So on-abnormal adds Timeout and Watchdog (unused).

I'm using Debian stable, whatever packages are with those. I think from the logs, it was the watchdog that OOM killed the process, and that's why on-abnormal seems to work in restarting the service.

From journactl:

Oct 18 10:24:10 sam-server systemd[1]: mariadb.service: A process of this unit >
░░ Subject: A process of mariadb.service unit has been killed by the OOM killer.
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A process of unit @UNIT has been killed by the Linux kernel out-of-memory (O>
░░ killer logic. This usually indicates that the system is low on memory and th>
░░ memory needed to be freed. A process associated with mariadb.service has bee>
░░ as the best process to terminate and has been forcibly terminated by the
░░ kernel.
░░ 
░░ Note that the memory pressure might or might not have been caused by mariadb>
Oct 18 10:24:10 sam-server systemd[1]: mariadb.service: Main process exited, co>
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit mariadb.service has exited.
░░ 
░░ The process' exit code is 'killed' and its exit status is 9.
Oct 18 10:24:11 sam-server systemd[1]: mariadb.service: Failed with result 'oom>
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit mariadb.service has entered the 'failed' state with result 'oom-kil>
Oct 18 10:24:11 sam-server systemd[1]: mariadb.service: Consumed 15min 52.379s >
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit mariadb.service completed and consumed the indicated resources.
Oct 18 10:24:16 sam-server systemd[1]: mariadb.service: Scheduled restart job, >
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ Automatic restarting of the unit mariadb.service has been scheduled, as the >
░░ the configured Restart= setting for the unit.

@Dreamsorcerer
Copy link
Author

Automatic restarting (i.e. the last log) does not happen with on-abort. From the logs, it's clear that systemd knows this has been oom-killed, rather than just knowing the exit code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants