Naemon stops executing checks and doesnt respawn Core Worker processes #418

ccztux · 2023-02-27T15:53:55Z

On a system running Naemon Core 1.3.0 we ran into the issue, that naemon stops executing checks. There were no more worker processes. I have not seen anything suspicious in the system-journal or dmesg. No SIGSEGV or oom_killer in action.

Log snippet of the Naemon log (host and servicenames anonymized):

[1677024519] Warning:  Check of host 'myhost' did not exit properly!
[1677024519] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024520] wproc: Socket to worker Core Worker 4261 broken, removing
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] HOST ALERT: myhost;DOWN;SOFT;3;CRITICAL - 10.0.0.63: rta nan, lost 100%
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] Warning:  Check of host 'myhost' did not exit properly!
[1677024521] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024521] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4258 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4258 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4260 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4260 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4259 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4259 broken, removing
[1677024526] Warning:  Check of host 'myhost' did not exit properly!
[1677024526] HOST ALERT: myhost;DOWN;SOFT;3;(Host check did not exit properly)
[1677024526] wproc: nm_bufferqueue_read() from Core Worker 4257 returned -1: Connection reset by peer
[1677024526] wproc: Socket to worker Core Worker 4257 broken, removing
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)

Independent of the root cause of the broken Core Worker processes, i think naemon should respawn the Core Worker processes, if there are no processes or less than desired.

This also happens with a manual installation with the actual version of the master branch Naemon Core 1.4.1.g2916d626.20230223.

Found this to reproduce the issue.

After looking into the source code i expected to hit the following if condition which doesnt happen:

naemon-core/src/naemon/workers.c

Lines 431 to 436 in 2916d62

    
           		if (workers.len <= 0) { 
        
           			/* there aren't global workers left, we can't run any more checks 
        
           			 * we should try respawning a few of the standard ones 
        
           			 */ 
        
           			nm_log(NSLOG_RUNTIME_ERROR, "wproc: All our workers are dead, we can't do anything!"); 
        
           		}

I will provide a fix for the respawning thing via a pull request.

The text was updated successfully, but these errors were encountered:

sni · 2023-02-27T15:57:05Z

Besides restarting the worker, it would be pretty interesting to know why the worker fail. Is it reproducable? If so, maybe attaching strace to one of the workers might reveil something.

processes (naemon#418)

ccztux · 2023-02-27T16:02:39Z

Unfortunately it is not reproduceable.

I agree with that. Unfortunately there was no worker process left. I will communicate this in my team, that we should connect strace to one or both of the leftover processes if this issue will appear again.

It looked like this:

17:01:41 ✓ LAB-CL01 root@cl01 ~/git-repos/naemon-core # systemctl status naemon
● naemon.service - Naemon Monitoring Daemon
   Loaded: loaded (/usr/lib/systemd/system/naemon.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2023-02-27 17:01:41 CET; 2s ago
     Docs: http://naemon.org/documentation
  Process: 4711 ExecStart=/usr/bin/naemon --daemon /etc/naemon/naemon.cfg (code=exited, status=0/SUCCESS)
  Process: 4665 ExecStartPre=/bin/su naemon --login --shell=/bin/sh --command=/usr/bin/naemon --verify-config /etc/naemon/naemon.cfg (code=exited, status=0/SUCCESS)
  Process: 4663 ExecStartPre=/usr/bin/chown -R naemon:naemon /var/run/naemon/ (code=exited, status=0/SUCCESS)
  Process: 4661 ExecStartPre=/usr/bin/mkdir -p /var/run/naemon (code=exited, status=0/SUCCESS)
 Main PID: 4713 (naemon)
   CGroup: /system.slice/naemon.service
           ├─4713 /usr/bin/naemon --daemon /etc/naemon/naemon.cfg
           └─4719 /usr/bin/naemon --daemon /etc/naemon/naemon.cfg

Feb 27 17:01:41 cl01 systemd[1]: Stopped Naemon Monitoring Daemon.
Feb 27 17:01:41 cl01 systemd[1]: Starting Naemon Monitoring Daemon...
Feb 27 17:01:41 cl01 su[4665]: (to naemon) root on none
Feb 27 17:01:41 cl01 systemd[1]: Started Naemon Monitoring Daemon.

ccztux · 2023-02-27T17:17:49Z

Just for clarifying, the root cause is not reproduceable, but if you kill all the worker processes you will see, that naemon doesnt respawn them, like described here.

nook24 · 2023-02-27T18:23:13Z

One of our users had the same issue a while ago. This was happening with Naemon 1.2.3 and this was the check plugin that manged to kill the worker process itself:
it-novum/openITCOCKPIT#1159 (comment)

Unfortunately i had no access to the system for further debugging.

fermino · 2024-04-10T12:12:21Z

I have been hit by the same bug. I haven't found yet how to reproduce it as it is currently a production system, it would be nice though to restart the workers automatically on failure. Currently I'm just checking the logs for the error and restarting the instance when necessary.

nook24 · 2024-04-10T17:01:49Z

Which Naemon Version are you using @fermino?
Naemon 1.4.2 should restart dead core workers: #421

ccztux added a commit to ccztux/naemon-core that referenced this issue Feb 27, 2023

Fixed: Naemon stops executing checks and doesnt respawn Core Worker

347c547

processes (naemon#418)

ccztux mentioned this issue Feb 27, 2023

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naemon stops executing checks and doesnt respawn Core Worker processes #418

Naemon stops executing checks and doesnt respawn Core Worker processes #418

ccztux commented Feb 27, 2023

sni commented Feb 27, 2023

ccztux commented Feb 27, 2023 •

edited

Loading

ccztux commented Feb 27, 2023 •

edited

Loading

nook24 commented Feb 27, 2023

fermino commented Apr 10, 2024

nook24 commented Apr 10, 2024

Naemon stops executing checks and doesnt respawn Core Worker processes #418

Naemon stops executing checks and doesnt respawn Core Worker processes #418

Comments

ccztux commented Feb 27, 2023

sni commented Feb 27, 2023

ccztux commented Feb 27, 2023 • edited Loading

ccztux commented Feb 27, 2023 • edited Loading

nook24 commented Feb 27, 2023

fermino commented Apr 10, 2024

nook24 commented Apr 10, 2024

ccztux commented Feb 27, 2023 •

edited

Loading

ccztux commented Feb 27, 2023 •

edited

Loading