-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naemon stops executing checks and doesnt respawn Core Worker processes #418
Comments
Besides restarting the worker, it would be pretty interesting to know why the worker fail. Is it reproducable? If so, maybe attaching strace to one of the workers might reveil something. |
Unfortunately it is not reproduceable. I agree with that. Unfortunately there was no worker process left. I will communicate this in my team, that we should connect strace to one or both of the leftover processes if this issue will appear again. It looked like this:
|
Just for clarifying, the root cause is not reproduceable, but if you kill all the worker processes you will see, that naemon doesnt respawn them, like described here. |
One of our users had the same issue a while ago. This was happening with Naemon 1.2.3 and this was the check plugin that manged to kill the worker process itself: Unfortunately i had no access to the system for further debugging. |
I have been hit by the same bug. I haven't found yet how to reproduce it as it is currently a production system, it would be nice though to restart the workers automatically on failure. Currently I'm just checking the logs for the error and restarting the instance when necessary. |
On a system running
Naemon Core 1.3.0
we ran into the issue, that naemon stops executing checks. There were no more worker processes. I have not seen anything suspicious in the system-journal or dmesg. NoSIGSEGV
oroom_killer
in action.Log snippet of the Naemon log (host and servicenames anonymized):
Independent of the root cause of the broken Core Worker processes, i think naemon should respawn the Core Worker processes, if there are no processes or less than desired.
This also happens with a manual installation with the actual version of the master branch
Naemon Core 1.4.1.g2916d626.20230223
.Found this to reproduce the issue.
After looking into the source code i expected to hit the following if condition which doesnt happen:
naemon-core/src/naemon/workers.c
Lines 431 to 436 in 2916d62
I will provide a fix for the respawning thing via a pull request.
The text was updated successfully, but these errors were encountered: