-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
old critical alerts in icinga do not go away after upgrade of openshift #26
Comments
Thanks for the feedback. We have considered different options for handling stale alerts in Icinga, but it's hard to implement a solution that's correct for arbitrary resend intervals in Alertmanager, since Signalilo cannot really distinguish between a critical alert with a high repeat interval and a stale alert, especially since Signalilo does not keep any local state. One possibility would be to adjust the Icinga checks to be active with a recheck interval that's somehow derived from the repeat interval of the alert in Alertmanager. However the value of the resend interval would have to be provided to Signalilo as an extra configuration value, as it's not available in the received alerts (Side-note: potentially the In the meantime, what you can to do clean up stale alerts, is to click "check now" in Icinga, which sets the alert status to OK (as |
I got some time to review your comment regarding this. |
At the time Signalilo performs garbage collection, we do not know which alerts are firing, since we do not keep any local state about alerts in Signalilo. Therefore we cannot just look at the firing alerts and GC all alerts which are not firing anymore, as we simply don't have information to determine which alerts are still firing when GC runs. I'm leaning towards the solution of using the Alertmanager resend interval, as an additional configuration value that needs to be provided to Signalilo, with a reasonably high default, to create active Icinga2 services. Those services should be checked with roughly the same frequency as Alertmanager resends the alerts. Note that the check interval in Icinga should be a bit higher than the resend interval to allow for some network latency. Since we already implement active checks for "heartbeat" alerts, this change should be doable. |
We have also the problem sometimes. But it seems difficult to reproduce: What I have seen, is that as soon a firing alert is seen, Signalilo computes a serviceName (see Line 52 in 1ebf0f3
If it is not the case, a new service is created in Icinga. Otherwise the service is updated. Maybe there are cases, where the labels are changed? |
First of all, great product: signalilo.
I recently set this up for our OpenShift clusters.
We had the following scenerio:
For our OpenShift Cluster A, we had bunch of critical alerts that showed up in Icinga.
Those alerts were not resolved (as in from OpenShift side).
We did an upgrade on our OpenShift Cluster, and after that re-added in the webhook config in alertmanager.
So from alertmanager perspective, it is now brand new. So old alerts in icinga were not resolved (they never got the resolved notification from alertmanager via signalilo).
Now in Icinga, we have this OpenShift Cluster set up as a Host "Test Host", and although new alerts are coming in and are resolved, the old alerts from previous version of OpenShift are still there.
I understand that there is a SIGNALILO_ICINGA_KEEP_FOR setting, but that is for OK and or resolved alerts.
I think that there should be a criteria such that if the alert is no longer firing from AlertManager, and if there are some lingering critical services in Icinga which did not receive any resolved status, then those should be garbage collected as well.
The text was updated successfully, but these errors were encountered: