Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

Refactor severity to use monitor thresholds #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

lngarrett
Copy link

@lngarrett lngarrett commented May 19, 2017

On our team we wanted a way to centralize the management of both our monitors, but also the conditional notifications in the monitors' messages. This PR ties a team's severity notifications to the alert threshold. Instead of setting a monitor to have a certain severity, the message will have conditional blocks containing the appropriate notification channels based on the threshold of the alert. I also put all notifications into a is_recovery block so that alerts auto resolve as expected.

Nothing new is required in the config, and all fields are optional.

Example message:

{#is_warning}
 @slack-cloud-operations
 @slack-product-support
{/is_warning}
{#is_alert}
 @slack-cloud-operations
 @pagerduty-CloudOperations
 @slack-product-support
 @pagerduty-ProductSupport
{/is_alert}
{#is_recovery}
 @slack-cloud-operations
 @pagerduty-CloudOperations
 @slack-product-support
 @pagerduty-ProductSupport
{/is_recovery}

@lngarrett
Copy link
Author

lngarrett commented May 19, 2017

After some experimentation, I think monitors would still benefit from being marked as critical or info. While my changes add functionality to centralize threshold logic, some monitors are simply not important enough to warrant paging an on-call engineer at any threshold. So, I think the full solution should involve both the original severity tagging along with my functionality. I'm going to gauge interest in these changes before adding that however.

I'm envisioning the teams config would look like this:

teams:
  eng:
    notifications:
      critical:
        alert:
        - '@hipchat-Engineering'
        - '@victorops-eng'
        warning:
        - '@hipchat-Engineering'
      info:
        alert:
        - '@hipchat-Engineering'

The idea here is that on a critical alert we would first alert the team chat channel so that during business hours engineers would see the issue. Then, if the monitor goes critical the engineer on call would be paged. However, for monitors tagged info we have decided to not do anything with warnings and only send a nonintrusive chat message when the monitor alerts.

@astropuffin
Copy link

I'm also looking to get this for my team. Is there anything missing in order to merge this? It doesn't seem entirely backward compatible, but works MUCH better for our workflow.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants