You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What's more, those alarms has a tendency to not get cleared and instead get stuck in our monitoring systems. I assume that happens because the node that originated the alarm gets rebooted while the alarm is active, and when it starts up again it has forgotten about the active alarm.
The text was updated successfully, but these errors were encountered:
On Thu, Dec 14, 2017 at 10:40:41PM -0800, Tore Anderson wrote:
When the RING is undergoing maintenance (kernel upgrades I guess), all
the RING nodes reboot within a short time window.
Do you have a proposed fix? I thought we had spread out the kernel
updates stuff a bit, but perhaps our random isn't good enough? We have:
```
Unattended-Upgrade::Automatic-Reboot "true";
APT::Periodic::RandomSleep "21600";
```
This tends to cause ring-sqa to detect outages and raise spurious
alarms. http://sqa.ring.nlnog.net/event/2498 is this morning's
example.
What's more, those alarms has a tendency to not get cleared and
instead get stuck in our monitoring systems. I assume that happens
because the node that originated the alarm gets rebooted while the
alarm is active, and when it starts up again it has forgotten about
the active alarm.
The clearing of alarms never made sense. It doesn't mean the problem
cleared or was resolved. You should not count on this meaning anything.
Haven't thought much about how to fix it, but maybe it would be possible to make ring-sqa gracefully leave the mesh somehow when it's being shut down? Otherwise I guess there's not much else to do than to spread out the reboots more than they currently are.
If alarm clearing doesn't make sense to begin with then maybe that functionality should be removed...?
When the RING is undergoing maintenance (kernel upgrades I guess), all the RING nodes reboot within a short time window.
This tends to cause ring-sqa to detect outages and raise spurious alarms. http://sqa.ring.nlnog.net/event/2498 is this morning's example.
What's more, those alarms has a tendency to not get cleared and instead get stuck in our monitoring systems. I assume that happens because the node that originated the alarm gets rebooted while the alarm is active, and when it starts up again it has forgotten about the active alarm.
The text was updated successfully, but these errors were encountered: