Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monquorum alert #52

Merged
merged 4 commits into from
May 17, 2019
Merged

Conversation

devopsjonas
Copy link
Contributor

@devopsjonas devopsjonas commented Apr 24, 2019

Adds CephMonHighNumberOfLeaderChanges
Fixes CephMonQuorumAtRisk to add label selector. In the case where you have multiple clusters monitored by same Prometheus instance.

Example output:

"groups":
- "name": "quorum-alert.rules"
  "rules":
  - "alert": "CephMonQuorumAtRisk"
    "annotations":
      "description": "Storage cluster quorum is low. Contact Support."
      "message": "Storage quorum at risk"
      "severity_level": "error"
      "storage_type": "ceph"
    "expr": |
      count(ceph_mon_quorum_status{job="rook-ceph-mgr"} == 1) <= ((count(ceph_mon_metadata{job="rook-ceph-mgr"}) % 2) + 1)
    "for": "15m"
    "labels":
      "severity": "critical"
  - "alert": "CephMonHighNumberOfLeaderChanges"
    "annotations":
      "description": "Ceph Monitor \"{{ $labels.job }}\": instance {{ $labels.instance }} has seen {{ $value }} leader changes recently."
      "message": "Storage Cluster has seen many leader changes recently."
      "severity_level": "warning"
      "storage_type": "ceph"
    "expr": |
      rate(ceph_mon_num_elections{job="rook-ceph-mgr"}[15m]) > 3
    "for": "30m"
    "labels":
      "severity": "warning"

count(ceph_mon_quorum_status == 1) <= ((count(ceph_mon_metadata) %s 2) + 1)
||| % '%',
count(ceph_mon_quorum_status{%s} == 1) <= ((count(ceph_mon_metadata{%s}) %s 2) + 1)
||| % [$._config.cephExporterSelector, $._config.cephExporterSelector, '%'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change made assuming that there will be multiple exporter jobs working at the same time ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense, as if you have 2 clusters monitored with same Prometheus the current version wouldn't work

@devopsjonas
Copy link
Contributor Author

@anmolsachan fixed. PTAL 🙂

@@ -7,8 +7,8 @@
{
alert: 'CephMonQuorumAtRisk',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another similar already present at https://github.com/ceph/ceph-mixins/blob/master/alerts/monquorum.libsonnet
How its different from that?

Copy link
Contributor Author

@devopsjonas devopsjonas May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shtripat it's the same alert I didn't change it, except for adding a exporter selector {%s}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@umangachapagain
Copy link
Collaborator

looks good to me.
Can you also run make all from project root and add the rule file generated?

@anmolsachan
Copy link
Collaborator

@devopsjonas PR looks good to me now. Please attach the generated manifest. You can refer to merged PR's for refenrence.

alerts/monquorum.libsonnet Outdated Show resolved Hide resolved
@devopsjonas
Copy link
Contributor Author

Generated manifest:

"groups":
- "name": "quorum-alert.rules"
  "rules":
  - "alert": "CephMonQuorumAtRisk"
    "annotations":
      "description": "Storage cluster quorum is low. Contact Support."
      "message": "Storage quorum at risk"
      "severity_level": "error"
      "storage_type": "ceph"
    "expr": |
      count(ceph_mon_quorum_status{job="rook-ceph-mgr"} == 1) <= ((count(ceph_mon_metadata{job="rook-ceph-mgr"}) % 2) + 1)
    "for": "15m"
    "labels":
      "severity": "critical"
  - "alert": "CephMonHighNumberOfLeaderChanges"
    "annotations":
      "description": "Ceph Monitor \"{{ $labels.job }}\": instance {{ $labels.instance }} has seen {{ $value }} leader changes recently."
      "message": "Storage Cluster has seen many leader changes recently."
      "severity_level": "warning"
      "storage_type": "ceph"
    "expr": |
      rate(ceph_mon_num_elections{job="rook-ceph-mgr"}[15m]) > 3
    "for": "30m"
    "labels":
      "severity": "warning"

@devopsjonas
Copy link
Contributor Author

@umangachapagain I've run:

make all            
jsonnet -S lib/alerts.jsonnet > prometheus_alert_rules.yaml
find . -name 'vendor' -prune -o -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \                                             
        while read f; do \
                jsonnet fmt -n 2 --max-blank-lines 2 --string-style s --comment-style s "$f" | diff -u "$f" -; \                       
        done
promtool check rules prometheus_alert_rules.yaml
Checking prometheus_alert_rules.yaml
  SUCCESS: 11 rules found
git status
On branch add-monquorum-alert
Your branch is up to date with 'origin/add-monquorum-alert'.

nothing to commit, working tree clean

I believe .gitignore is making sure that we don't commit generated alerts.

@shtripat
Copy link
Contributor

@devopsjonas You can run build.sh from extras dir. That will generate a rule file in manifests dir. That file you can check in.

@devopsjonas
Copy link
Contributor Author

Done, PTAL :)

Copy link
Contributor

@shtripat shtripat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shtripat
Copy link
Contributor

@devopsjonas we have got unit test mechanism recently added to the project. I would expect you to add unit tests for the alerting rules as part of these PRs. Otherwise both the PRs looks good to be merged.
Feel free to send another PR for unit tests separately.

@devopsjonas
Copy link
Contributor Author

@shtripat sure will do 👍

@shtripat
Copy link
Contributor

Merging this now. Add another PR for unit tests.

@shtripat shtripat merged commit 47fb182 into ceph:master May 17, 2019
@devopsjonas devopsjonas deleted the add-monquorum-alert branch May 21, 2019 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants