Skip to content
This repository has been archived by the owner on Apr 22, 2020. It is now read-only.

zmon-worker connect to zmon-redis causes flaky uncatchable alert #276

Open
szuecs opened this issue Sep 29, 2017 · 4 comments
Open

zmon-worker connect to zmon-redis causes flaky uncatchable alert #276

szuecs opened this issue Sep 29, 2017 · 4 comments
Labels
bug user-reported issues that take more than one hour to resolve investigate

Comments

@szuecs
Copy link
Contributor

szuecs commented Sep 29, 2017

During cluster updates zmon-redis will be down for some time and zmon-worker are not tollerating this downtime. It will trigger uncatchable exceptions, which don't provide value to us. We would like to get these alerts not triggered.

I hope I provided enough information.

alert

def alert():
    return value.get("pods", 0) > 1000

check

def check():
    try:
        return {
            "pods": len(kubernetes(namespace=None).pods()),
            "_use_scheduled_time": True,
        }
    except Exception as e:
        return {"exception": str(e), "_use_scheduled_time": True}

history

2017-09-29 13:42:44 | ALERT_ENTITY_STARTED | "kube-cluster[aws:537814120105:eu-central-1]" | {"td":131.41097402572632,"worker":"plocal.zmon-worker-2295089407-ggqbz","ts":1.506685224806833E9,"value":"Error 110 connecting to zmon-redis:6379. Connection timed out.","exc":1}
-- | -- | -- | --

kubernetes pods

% kubectl get pods -n kube-system -l application=zmon-redis
NAME                          READY     STATUS    RESTARTS   AGE
zmon-redis-1546107048-9lzgg   1/1       Running   0          11m

% kubectl get pods -n kube-system -l application=zmon-worker
NAME                           READY     STATUS    RESTARTS   AGE
zmon-worker-2295089407-ggqbz   2/2       Running   0          46m
zmon-worker-2295089407-qp991   2/2       Running   0          7m
@beverage
Copy link

@szuecs Is this still an issue for you?

@szuecs
Copy link
Contributor Author

szuecs commented Jul 26, 2018

@beverage did you fixed it?
If not sure there is a problem. @mohabusama might be the right person to answer this question.

@mohabusama
Copy link
Contributor

This is still an issue.

@szuecs
Copy link
Contributor Author

szuecs commented Nov 22, 2018

@mohabusama can't you just save the exception and store it in some value, for example "exception", to pass it to the alert function?
Then nobody needs to wrap these check functions and it can easily handled by the alert.

@pitr pitr added the bug user-reported issues that take more than one hour to resolve label May 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug user-reported issues that take more than one hour to resolve investigate
Projects
None yet
Development

No branches or pull requests

4 participants