RING maintenance causes spurious and stuck alarms #19

toreanderson · 2017-12-15T06:40:41Z

When the RING is undergoing maintenance (kernel upgrades I guess), all the RING nodes reboot within a short time window.

This tends to cause ring-sqa to detect outages and raise spurious alarms. http://sqa.ring.nlnog.net/event/2498 is this morning's example.

What's more, those alarms has a tendency to not get cleared and instead get stuck in our monitoring systems. I assume that happens because the node that originated the alarm gets rebooted while the alarm is active, and when it starts up again it has forgotten about the active alarm.

job · 2017-12-15T13:05:07Z

On Thu, Dec 14, 2017 at 10:40:41PM -0800, Tore Anderson wrote: When the RING is undergoing maintenance (kernel upgrades I guess), all the RING nodes reboot within a short time window.

Do you have a proposed fix? I thought we had spread out the kernel updates stuff a bit, but perhaps our random isn't good enough? We have: ``` Unattended-Upgrade::Automatic-Reboot "true"; APT::Periodic::RandomSleep "21600"; ```

This tends to cause ring-sqa to detect outages and raise spurious alarms. http://sqa.ring.nlnog.net/event/2498 is this morning's example. What's more, those alarms has a tendency to not get cleared and instead get stuck in our monitoring systems. I assume that happens because the node that originated the alarm gets rebooted while the alarm is active, and when it starts up again it has forgotten about the active alarm.

The clearing of alarms never made sense. It doesn't mean the problem cleared or was resolved. You should not count on this meaning anything.

toreanderson · 2017-12-15T14:15:22Z

Haven't thought much about how to fix it, but maybe it would be possible to make ring-sqa gracefully leave the mesh somehow when it's being shut down? Otherwise I guess there's not much else to do than to spread out the reboots more than they currently are.

If alarm clearing doesn't make sense to begin with then maybe that functionality should be removed...?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RING maintenance causes spurious and stuck alarms #19

RING maintenance causes spurious and stuck alarms #19

toreanderson commented Dec 15, 2017

job commented Dec 15, 2017 via email

toreanderson commented Dec 15, 2017

RING maintenance causes spurious and stuck alarms #19

RING maintenance causes spurious and stuck alarms #19

Comments

toreanderson commented Dec 15, 2017

job commented Dec 15, 2017 via email

toreanderson commented Dec 15, 2017