-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421
Conversation
Fixed the indentation stuff...
Comment changed.
…r process dies. This will cleanup any <defunct> processes. I'm not sure why the WPROC_FORCE flag exists at all. Maybe this could cause problems, when external workers not spawned by Naemon itself connect to the Query Handler? The original commit ec4dc03 from 10 years ago did not contain more information about this Signed-off-by: nook24 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good in general. I only have a small change request for the get_desired_workers
I tried but couldn't get it working, whenever i kill a worker process i indeed get a message that a new worker got spawned, but a couple of seconds later the main naemon process aborts.
I assume it is because we removed the worker structure but it is still referended in some jobs. |
…and gets respawned Signed-off-by: nook24 <[email protected]>
I have tried to wrap my head around this, but I'm not sure about how valid my results are :) naemon-core/src/naemon/workers.c Lines 162 to 170 in d6c4c6e
Without the naemon-core/src/naemon/checks_host.c Lines 630 to 633 in d6c4c6e
The segmentation fault occurs at the end of the handle_worker_host_check callback, when the cr should get freed.
The part I'm unsure about is, that naemon-core/src/naemon/workers.c Lines 176 to 187 in d6c4c6e
With the current change, Naemon was not crashing anymore. I created this sketchy script for testing <?php
exec('ps -eaf | grep "naemon --worker" | grep -v "grep"', $output);
foreach($output as $line){
$line = explode(" ", $line);
$pid = $line[6];
if(is_numeric($pid)){
echo "kill $pid\n";
exec('kill '.$pid);
}
}
|
Previously we used a fixed size 8k buffer when parsing command line arguments. This sounds much, but there are command lines bigger than 8k and they are simply cut off without any warning. Instead we use a dynamic sized buffer with the size of the raw command now.
…ation Signed-off-by: nook24 <[email protected]>
Set LD_LIBRARY_PATH when running inside of VS Code to the correct loc…
Print the current and expected api version number along with the error. This gives a hint about wether the neb module is too new or too old.
this (should) fix this build error on obs: ``` [ 139s] /.build_patchrpmcheck_scr: line 55: systemd-tmpfiles: command not found [ 139s] postinstall script of naemon-core-1.4.1-lp154.18.1.x86_64.rpm failed ```
Is there anything we can do to get this merged? |
is it stable now? Did you make any tests lately? |
It's a good standard to do so and in fact, we do this already in several places, ex. the status.dat. This ensures the file is ready and completly written before it will be used. The issue here is, that naemon starts without any issues if the precached file is empty for any reason. Except it has zero hosts/services then and removes all existing states/downtime/comments. Signed-off-by: Sven Nierlein <[email protected]>
left over from copy/pasted code.
I will setup a test system and report |
Good morning, a quick update on this. My test system On November 2, I have deployed the patched version of Naemon. To make sure that the re-spawning of dead worker processes works as expected a cronjob kills random naemon worker processes every 5 minutes. I have only killed the worker processes ( Results The only thing i noticed is that the "Active service check latency" metric breaks. Naemon PR 421 |
I guess this would need a fix for the check latency before it can be merged? Besides that, this is a great feature imo. |
what exactly breaks the latency here? |
Not sure why it breaks, I was merely referring to the comments/screenshots by @nook24 above. |
After reading all the event code over and over again, it could be that the latency issue is related to this change: 978bbf3 It is not part of Naemon 1.4.1 which is the version I used for my baseline measurements. The current master branch / this branch is not reporting 0 values for latency, i was also able to get values from 1ms up to 55ms. The code describes the latency value as follows:
If I understand this correct, the latency values represents how long the event was stored in the queue before a free worker picked it up for processing. Due to my test system is pretty small and has not much to do, it should not take to long for a free worker to handle the event. How ever, I'm not sure what this tries to achieve: naemon-core/src/naemon/events.c Line 375 in 83b25ec
|
latency will always be in a range of 0-1000ms because the scheduled checks have 1 second resolution but are going to be randomized on millisecond resolution into the run queue. |
just one thing, could you remove the "TODO" thing :-) Otherwise i am fine... |
Signed-off-by: nook24 <[email protected]>
Uups, I have removed this.
I'm not sure if it is "broken" or not. As long as you are fine with the reported value for latency I'm fine with it as well I guess |
Add Respawn of dead core worker naemon#421
Add Respawn of dead core worker #421
While reviewing #419, I noticed that Naemon is not taking care about it's dead worker processes. Whenever a worker process dies, it will become a
<defunct>
zombie process.The reason for this was, that Naemon did not call
waitpid()
when a worker process dies. I did not find any explanation why this was implemented this way.Could
waitpid(wp->pid, &i, 0)
hang (wait forever), if the worker process was not forked from the currently running Naemon? We could probably replace this bywaitpid(wp->pid, &i, WNOHANG)
, but I'm not 100% sure about this.Best I could find about this was: NagiosEnterprises/nagioscore#635