Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421

Merged
merged 19 commits into from
Jan 3, 2024

Conversation

nook24
Copy link
Member

@nook24 nook24 commented Mar 2, 2023

While reviewing #419, I noticed that Naemon is not taking care about it's dead worker processes. Whenever a worker process dies, it will become a <defunct> zombie process.
The reason for this was, that Naemon did not call waitpid() when a worker process dies. I did not find any explanation why this was implemented this way.
Could waitpid(wp->pid, &i, 0) hang (wait forever), if the worker process was not forked from the currently running Naemon? We could probably replace this by waitpid(wp->pid, &i, WNOHANG), but I'm not 100% sure about this.
Best I could find about this was: NagiosEnterprises/nagioscore#635

ccztux and others added 5 commits February 27, 2023 16:57
Fixed the indentation stuff...
Comment changed.
…r process dies.

This will cleanup any <defunct> processes. I'm not sure why the WPROC_FORCE flag exists
at all. Maybe this could cause problems, when external workers not spawned by Naemon itself connect to the Query Handler?
The original commit ec4dc03 from 10 years ago did not contain more information about this

Signed-off-by: nook24 <[email protected]>
Copy link
Contributor

@sni sni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good in general. I only have a small change request for the get_desired_workers

src/naemon/workers.c Show resolved Hide resolved
@nook24 nook24 requested a review from sni March 15, 2023 15:28
@sni
Copy link
Contributor

sni commented Mar 17, 2023

I tried but couldn't get it working, whenever i kill a worker process i indeed get a message that a new worker got spawned, but a couple of seconds later the main naemon process aborts.

Program received signal SIGSEGV, Segmentation fault.
0x00007f475305bdfd in __strlen_avx2 () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glib2-2.68.4-5.el9.x86_64 libicu-67.1-9.el9.x86_64 libstdc++-11.3.1-2.1.el9.x86_64 pcre-8.44-3.el9.3.x86_64
(gdb) bt
#0  0x00007f475305bdfd in __strlen_avx2 () from /lib64/libc.so.6
#1  0x00007f4753456b68 in dopr (buffer=0x0, maxlen=0, format=0x7f4753459478 "Processing check result file: '%s'\n", args_in=0x7ffe7f668e10) at lib/snprintf.c:712
#2  0x00007f4753457d14 in smb_vsnprintf (str=0x0, count=0, fmt=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n", args=0x7ffe7f668e10) at lib/snprintf.c:1197
#3  0x00007f4753457e21 in vasprintf (ptr=0x7ffe7f668e50, format=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n", ap=0x7ffe7f668e58) at lib/snprintf.c:1232
#4  0x00007f47533feae9 in nm_log (data_type=1, fmt=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n") at src/naemon/logging.c:165
#5  0x00007f47533dcf04 in process_check_result (cr=0x14cbbd0) at src/naemon/checks.c:345
#6  0x00007f47533e21fb in handle_worker_service_check (wpres=0x7ffe7f668ff0, arg=0x14cbbd0, flags=0) at src/naemon/checks_service.c:404
#7  0x00007f475342317c in run_job_callback (job=0x1230520, wpres=0x7ffe7f668ff0, val=0) at src/naemon/workers.c:167
#8  0x00007f4753424046 in handle_worker_result (sd=7, events=1, arg=0x11a92c0) at src/naemon/workers.c:536
#9  0x00007f47534516d9 in iobroker_poll (iobs=0x11a93c0, timeout=785) at lib/iobroker.c:376
#10 0x00007f47533fcd21 in event_poll_full (iobs=0x11a93c0, timeout_ms=1500) at src/naemon/events.c:331
#11 0x00007f47533fce87 in event_poll () at src/naemon/events.c:384
#12 0x0000000000403916 in event_execution_loop () at src/naemon/naemon.c:119
#13 0x0000000000404ca2 in main (argc=2, argv=0x7ffe7f6694f8) at src/naemon/naemon.c:708

I assume it is because we removed the worker structure but it is still referended in some jobs.

@nook24
Copy link
Member Author

nook24 commented Mar 23, 2023

I have tried to wrap my head around this, but I'm not sure about how valid my results are :)
I think the reason for the crash is, that wpress is empty (0x0) when the run_job_callback function get's executed:

static void run_job_callback(struct wproc_job *job, struct wproc_result *wpres, int val)
{
if (!job || !job->callback)
return;
if (!wpres) {
nm_log(NSLOG_RUNTIME_ERROR, "---!!!--- wpres is null or so TODO REMOVE THIS");
return;
}

Without the return, the callback function static void handle_worker_host_check or static void handle_worker_service_check would get called. At this point the check_result struct contains pure garbage which leads to the segmentation fault.

static void handle_worker_host_check(wproc_result *wpres, void *arg, int flags)
{
check_result *cr = (check_result *)arg;
struct host *hst;

The segmentation fault occurs at the end of the handle_worker_host_check callback, when the cr should get freed.

The part I'm unsure about is, that destroy_job also calls run_job_callback as pass NULL as wpress. I'm not sure what this should cleanup?

static void destroy_job(gpointer job_)
{
struct wproc_job *job = job_;
if (!job)
return;
/* call with NULL result to make callback clean things up */
run_job_callback(job, NULL, 0);
nm_free(job->command);
free(job);
}

With the current change, Naemon was not crashing anymore. I created this sketchy script for testing

<?php

exec('ps -eaf | grep "naemon --worker" | grep -v "grep"', $output);
foreach($output as $line){
	$line = explode(" ", $line);
	$pid = $line[6];
	if(is_numeric($pid)){
		echo "kill $pid\n";
		exec('kill '.$pid);
	}
}

watch --interval=5 "php kill.php" allows to test with different killing intervals. Naemon was running in my test for round about 1 hour, while the script killed the worker processes every 5 seconds. No crashes or memory leaks so far but this requires further testing

sni and others added 6 commits July 6, 2023 15:36
Previously we used a fixed size 8k buffer when parsing command line arguments.
This sounds much, but there are command lines bigger than 8k and they are
simply cut off without any warning.

Instead we use a dynamic sized buffer with the size of the raw command now.
Set LD_LIBRARY_PATH when running inside of VS Code to the correct loc…
Print the current and expected api version number along with the error.
This gives a hint about wether the neb module is too new or too old.
this (should) fix this build error on obs:
```
[  139s] /.build_patchrpmcheck_scr: line 55: systemd-tmpfiles: command not found
[  139s] postinstall script of naemon-core-1.4.1-lp154.18.1.x86_64.rpm failed
```
@nook24
Copy link
Member Author

nook24 commented Sep 18, 2023

Is there anything we can do to get this merged?

@sni
Copy link
Contributor

sni commented Oct 12, 2023

is it stable now? Did you make any tests lately?

It's a good standard to do so and in fact, we do this already in several
places, ex. the status.dat. This ensures the file is ready and completly
written before it will be used.

The issue here is, that naemon starts without any issues if the precached file
is empty for any reason. Except it has zero hosts/services then and removes all
existing states/downtime/comments.

Signed-off-by: Sven Nierlein <[email protected]>
left over from copy/pasted code.
@nook24
Copy link
Member Author

nook24 commented Oct 25, 2023

is it stable now? Did you make any tests lately?

I will setup a test system and report

@nook24
Copy link
Member Author

nook24 commented Nov 6, 2023

Good morning, a quick update on this.

My test system
I had setup a system with the latest Naemon stable version and loaded the Statusengine Event Broker. (Ubuntu 22.04)
This system was running for 7 days to collect baseline metrics.

On November 2, I have deployed the patched version of Naemon. To make sure that the re-spawning of dead worker processes works as expected a cronjob kills random naemon worker processes every 5 minutes. I have only killed the worker processes (--worker).

Results
I did not noticed any difference. The patched Naemon respawned new worker processes as expected. All checks got executed and I did not see any memory leak.

The only thing i noticed is that the "Active service check latency" metric breaks.

Naemon Stable
just_plain_naemon

Naemon PR 421

naemon_killing_pr

Naemon killing cronjob
naemon_kill_tmux

Naemonstats
grafik

@nook24
Copy link
Member Author

nook24 commented Nov 30, 2023

3 weeks later, Naemon is still running fine and re-spawns randomly killed worker every 5 minutes
grafik

@jacobbaungard
Copy link
Contributor

jacobbaungard commented Nov 30, 2023

I guess this would need a fix for the check latency before it can be merged? Besides that, this is a great feature imo.

@sni
Copy link
Contributor

sni commented Nov 30, 2023

what exactly breaks the latency here?

@jacobbaungard
Copy link
Contributor

what exactly breaks the latency here?

Not sure why it breaks, I was merely referring to the comments/screenshots by @nook24 above.

@nook24
Copy link
Member Author

nook24 commented Dec 1, 2023

After reading all the event code over and over again, it could be that the latency issue is related to this change: 978bbf3

It is not part of Naemon 1.4.1 which is the version I used for my baseline measurements.
@sni mentioned in his commit message that the old calculation for latency is wrong. My test system has 21 Services (13 active / 8 passive) and a active check latency of 603ms (as Naemon 1.4.1 reports) sounds way to high if I think about it.

The current master branch / this branch is not reporting 0 values for latency, i was also able to get values from 1ms up to 55ms.
grafik

The code describes the latency value as follows:

/* latency is how long the event lagged behind the event queue */

If I understand this correct, the latency values represents how long the event was stored in the queue before a free worker picked it up for processing. Due to my test system is pretty small and has not much to do, it should not take to long for a free worker to handle the event.

How ever, I'm not sure what this tries to achieve:

evprop.attributes.timed.latency = -time_diff / 1000.0;

@sni
Copy link
Contributor

sni commented Dec 1, 2023

latency will always be in a range of 0-1000ms because the scheduled checks have 1 second resolution but are going to be randomized on millisecond resolution into the run queue.
If that is was is "broken" then it was like that before and fixing that would be purely cosmetic.

@sni
Copy link
Contributor

sni commented Dec 1, 2023

just one thing, could you remove the "TODO" thing :-) Otherwise i am fine...

@nook24
Copy link
Member Author

nook24 commented Dec 1, 2023

just one thing, could you remove the "TODO" thing :-) Otherwise i am fine...

Uups, I have removed this.

If that is was is "broken" then it was like that before and fixing that would be purely cosmetic.

I'm not sure if it is "broken" or not. As long as you are fine with the reported value for latency I'm fine with it as well I guess

@nook24 nook24 merged commit 50b573a into master Jan 3, 2024
19 checks passed
nook24 added a commit to sni/naemon-core that referenced this pull request Jan 3, 2024
Add Respawn of dead core worker naemon#421
sni pushed a commit that referenced this pull request Jan 4, 2024
Add Respawn of dead core worker #421
@sni sni deleted the pr/419 branch January 4, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants