Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLO] Exclude stale slos from healthy count on overview #201027

Conversation

justinkambic
Copy link
Contributor

@justinkambic justinkambic commented Nov 20, 2024

Summary

Resolves #198911.

The result is achieved by nesting a new filter agg inside the existing HEALTHY agg to remove any stale SLOs from the ultimate result.

This required a modification of the parsing code on the ES response to include a new not_stale key. The original success total is preserved in the doc_count of that agg, but is no longer referenced.

The filter for the not_stale agg I have added is the logical inverse of the filter we're using to determine stale SLOs:

{
  "range": {
    "summaryUpdatedAt": {
      "gte": "now-48h"
    }
  }
}

Reviewer note: I also changed the spelling of a UI component, should be a completely transparent change.

Example

Before

This is my local running on main:

image

After

This is my local running on this PR branch:

image

Proof query works

You can replicate these results by including a similar agg on a query against SLO data. I added a terms agg to the stale agg to determine how many SLOs I need to remove. The number of HEALTHY SLOs showing up in stale should match the difference between the total doc_count from healthy and the doc_count in the not_stale sub-aggregation.

Query

You can run this example aggs:

{
  "aggs": {
    "stale": {
      "filter": {
        "range": {
          "summaryUpdatedAt": {
            "lt": "now-48h"
          }
        }
      },
      "aggs": {
        "by_status": {
          "terms": {
            "field": "status"
          }
        }
      }
    },
    "healthy": {
      "filter": {
        "term": {
          "status": "HEALTHY"
        }
      },
      "aggs": {
        "not_stale": {
          "filter": {
            "range": {
              "summaryUpdatedAt": {
                "gte": "now-48h"
              }
            }
          }
        }
      }
    }
  }
}

Relevant output

Here's a subset of my example query output. You can see that stale.by_status.buckets[1] contains a total of 2 docs, which is the difference between healthy.doc_count and healthy.not_stale.doc_count.

{
  "stale": {
    "doc_count": 7,
    "by_status": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "VIOLATED",
          "doc_count": 5
        },
        {
          "key": "HEALTHY",
          "doc_count": 2
        }
      ]
    }
  },
  "healthy": {
    "doc_count": 9,
    "not_stale": {
      "doc_count": 7
    }
  }
}

@justinkambic justinkambic added release_note:enhancement v9.0.0 backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Team:obs-ux-management Observability Management User Experience Team v8.17.0 labels Nov 20, 2024
@justinkambic justinkambic self-assigned this Nov 20, 2024
@justinkambic justinkambic requested a review from a team as a code owner November 20, 2024 19:08
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Nov 20, 2024
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@kdelemme kdelemme self-assigned this Nov 20, 2024
@@ -133,7 +144,7 @@ export class GetSLOsOverview {
return {
violated: aggs?.violated.doc_count ?? 0,
degrading: aggs?.degrading.doc_count ?? 0,
healthy: aggs?.healthy.doc_count ?? 0,
healthy: aggs?.healthy?.not_stale?.doc_count ?? 0,
Copy link
Contributor

@shahzad31 shahzad31 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that i am thinking, i think same should be subtracted from degrading and violated SLOs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this offline and agreed we can apply this filtering further up rather than adding sub-aggregations for all of the non-stale filters. I'll ping when this is done.

@justinkambic justinkambic force-pushed the 198911/exclude-stale-slos-from-healthy-count-on-overview branch from 1bfa7e7 to 1186417 Compare November 22, 2024 18:25
@@ -9,7 +9,7 @@ import { EuiFlexItem, EuiStat, EuiToolTip } from '@elastic/eui';
import React from 'react';
import { useUrlSearchState } from '../../hooks/use_url_search_state';

export function OverViewItem({
export function OverviewItem({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Copy link
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep using the settings, and also if we can double check the usage of worst. I have the feeling it is not used and could be 🔪
Otherwise, looks good to me.

@justinkambic justinkambic force-pushed the 198911/exclude-stale-slos-from-healthy-count-on-overview branch from 060ebcc to 566e2ac Compare November 25, 2024 17:43
@justinkambic justinkambic force-pushed the 198911/exclude-stale-slos-from-healthy-count-on-overview branch from 566e2ac to fcd65bd Compare November 26, 2024 14:07
Copy link
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one thing to cleanup but otherwise 👍🏻

@justinkambic justinkambic enabled auto-merge (squash) November 26, 2024 14:29
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 26, 2024

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: 92033ef
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-201027-92033efbd663

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #9 / management Index patterns on aliases discover verify hits should be able to discover and verify no of hits for alias2

Metrics [docs]

✅ unchanged

History

cc @kdelemme @justinkambic

@justinkambic justinkambic merged commit a92103b into elastic:main Nov 26, 2024
26 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.17, 8.x

https://github.com/elastic/kibana/actions/runs/12034819890

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 26, 2024
## Summary

Resolves elastic#198911.

The result is achieved by nesting a new filter agg inside the existing
`HEALTHY` agg to remove any stale SLOs from the ultimate result.

This required a modification of the parsing code on the ES response to
include a new `not_stale` key. The original `success` total is preserved
in the `doc_count` of that agg, but is no longer referenced.

The filter for the `not_stale` agg I have added is the logical inverse
of the filter we're using to determine stale SLOs:

```json
{
  "range": {
    "summaryUpdatedAt": {
      "gte": "now-48h"
    }
  }
}
```

_Reviewer note: I also changed the spelling of a UI component, should be
a completely transparent change._

## Example

### Before

This is my local running on `main`:

<img width="1116" alt="image"
src="https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225">

### After

This is my local running on this PR branch:

<img width="1120" alt="image"
src="https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2">

### Proof query works

You can replicate these results by including a similar agg on a query
against SLO data. I added a terms agg to the `stale` agg to determine
how many SLOs I need to remove. The number of `HEALTHY` SLOs showing up
in `stale` should match the difference between the total `doc_count`
from `healthy` and the `doc_count` in the `not_stale` sub-aggregation.

#### Query

You can run this example aggs:

```json
{
  "aggs": {
    "stale": {
      "filter": {
        "range": {
          "summaryUpdatedAt": {
            "lt": "now-48h"
          }
        }
      },
      "aggs": {
        "by_status": {
          "terms": {
            "field": "status"
          }
        }
      }
    },
    "healthy": {
      "filter": {
        "term": {
          "status": "HEALTHY"
        }
      },
      "aggs": {
        "not_stale": {
          "filter": {
            "range": {
              "summaryUpdatedAt": {
                "gte": "now-48h"
              }
            }
          }
        }
      }
    }
  }
}
```

#### Relevant output

Here's a subset of my example query output. You can see that
`stale.by_status.buckets[1]` contains a total of 2 docs, which is the
difference between `healthy.doc_count` and
`healthy.not_stale.doc_count`.

```json
{
  "stale": {
    "doc_count": 7,
    "by_status": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "VIOLATED",
          "doc_count": 5
        },
        {
          "key": "HEALTHY",
          "doc_count": 2
        }
      ]
    }
  },
  "healthy": {
    "doc_count": 9,
    "not_stale": {
      "doc_count": 7
    }
  }
}
```

(cherry picked from commit a92103b)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 26, 2024
## Summary

Resolves elastic#198911.

The result is achieved by nesting a new filter agg inside the existing
`HEALTHY` agg to remove any stale SLOs from the ultimate result.

This required a modification of the parsing code on the ES response to
include a new `not_stale` key. The original `success` total is preserved
in the `doc_count` of that agg, but is no longer referenced.

The filter for the `not_stale` agg I have added is the logical inverse
of the filter we're using to determine stale SLOs:

```json
{
  "range": {
    "summaryUpdatedAt": {
      "gte": "now-48h"
    }
  }
}
```

_Reviewer note: I also changed the spelling of a UI component, should be
a completely transparent change._

## Example

### Before

This is my local running on `main`:

<img width="1116" alt="image"
src="https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225">

### After

This is my local running on this PR branch:

<img width="1120" alt="image"
src="https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2">

### Proof query works

You can replicate these results by including a similar agg on a query
against SLO data. I added a terms agg to the `stale` agg to determine
how many SLOs I need to remove. The number of `HEALTHY` SLOs showing up
in `stale` should match the difference between the total `doc_count`
from `healthy` and the `doc_count` in the `not_stale` sub-aggregation.

#### Query

You can run this example aggs:

```json
{
  "aggs": {
    "stale": {
      "filter": {
        "range": {
          "summaryUpdatedAt": {
            "lt": "now-48h"
          }
        }
      },
      "aggs": {
        "by_status": {
          "terms": {
            "field": "status"
          }
        }
      }
    },
    "healthy": {
      "filter": {
        "term": {
          "status": "HEALTHY"
        }
      },
      "aggs": {
        "not_stale": {
          "filter": {
            "range": {
              "summaryUpdatedAt": {
                "gte": "now-48h"
              }
            }
          }
        }
      }
    }
  }
}
```

#### Relevant output

Here's a subset of my example query output. You can see that
`stale.by_status.buckets[1]` contains a total of 2 docs, which is the
difference between `healthy.doc_count` and
`healthy.not_stale.doc_count`.

```json
{
  "stale": {
    "doc_count": 7,
    "by_status": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "VIOLATED",
          "doc_count": 5
        },
        {
          "key": "HEALTHY",
          "doc_count": 2
        }
      ]
    }
  },
  "healthy": {
    "doc_count": 9,
    "not_stale": {
      "doc_count": 7
    }
  }
}
```

(cherry picked from commit a92103b)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.17
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Nov 26, 2024
…) (#201830)

# Backport

This will backport the following commits from `main` to `8.17`:
- [[SLO] Exclude stale slos from healthy count on overview
(#201027)](#201027)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Justin
Kambic","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-26T16:23:20Z","message":"[SLO]
Exclude stale slos from healthy count on overview (#201027)\n\n##
Summary\r\n\r\nResolves #198911.\r\n\r\nThe result is achieved by
nesting a new filter agg inside the existing\r\n`HEALTHY` agg to remove
any stale SLOs from the ultimate result.\r\n\r\nThis required a
modification of the parsing code on the ES response to\r\ninclude a new
`not_stale` key. The original `success` total is preserved\r\nin the
`doc_count` of that agg, but is no longer referenced.\r\n\r\nThe filter
for the `not_stale` agg I have added is the logical inverse\r\nof the
filter we're using to determine stale SLOs:\r\n\r\n```json\r\n{\r\n
\"range\": {\r\n \"summaryUpdatedAt\": {\r\n \"gte\": \"now-48h\"\r\n
}\r\n }\r\n}\r\n```\r\n\r\n_Reviewer note: I also changed the spelling
of a UI component, should be\r\na completely transparent
change._\r\n\r\n## Example\r\n\r\n### Before\r\n\r\nThis is my local
running on `main`:\r\n\r\n<img width=\"1116\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225\">\r\n\r\n\r\n###
After\r\n\r\nThis is my local running on this PR branch:\r\n\r\n<img
width=\"1120\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2\">\r\n\r\n\r\n###
Proof query works\r\n\r\nYou can replicate these results by including a
similar agg on a query\r\nagainst SLO data. I added a terms agg to the
`stale` agg to determine\r\nhow many SLOs I need to remove. The number
of `HEALTHY` SLOs showing up\r\nin `stale` should match the difference
between the total `doc_count`\r\nfrom `healthy` and the `doc_count` in
the `not_stale` sub-aggregation.\r\n\r\n#### Query\r\n\r\nYou can run
this example aggs:\r\n\r\n```json\r\n{\r\n \"aggs\": {\r\n \"stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"lt\": \"now-48h\"\r\n }\r\n }\r\n },\r\n \"aggs\": {\r\n
\"by_status\": {\r\n \"terms\": {\r\n \"field\": \"status\"\r\n }\r\n
}\r\n }\r\n },\r\n \"healthy\": {\r\n \"filter\": {\r\n \"term\": {\r\n
\"status\": \"HEALTHY\"\r\n }\r\n },\r\n \"aggs\": {\r\n \"not_stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"gte\": \"now-48h\"\r\n }\r\n }\r\n }\r\n }\r\n }\r\n }\r\n
}\r\n}\r\n```\r\n\r\n#### Relevant output\r\n\r\nHere's a subset of my
example query output. You can see that\r\n`stale.by_status.buckets[1]`
contains a total of 2 docs, which is the\r\ndifference between
`healthy.doc_count`
and\r\n`healthy.not_stale.doc_count`.\r\n\r\n```json\r\n{\r\n \"stale\":
{\r\n \"doc_count\": 7,\r\n \"by_status\": {\r\n
\"doc_count_error_upper_bound\": 0,\r\n \"sum_other_doc_count\": 0,\r\n
\"buckets\": [\r\n {\r\n \"key\": \"VIOLATED\",\r\n \"doc_count\": 5\r\n
},\r\n {\r\n \"key\": \"HEALTHY\",\r\n \"doc_count\": 2\r\n }\r\n ]\r\n
}\r\n },\r\n \"healthy\": {\r\n \"doc_count\": 9,\r\n \"not_stale\":
{\r\n \"doc_count\": 7\r\n }\r\n
}\r\n}\r\n```","sha":"a92103b2a9c06e3af30dea591ac769b995c78145","branchLabelMapping":{"^v9.0.0$":"main","^v8.18.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","v9.0.0","backport:prev-minor","ci:project-deploy-observability","Team:obs-ux-management","v8.17.0"],"title":"[SLO]
Exclude stale slos from healthy count on
overview","number":201027,"url":"https://github.com/elastic/kibana/pull/201027","mergeCommit":{"message":"[SLO]
Exclude stale slos from healthy count on overview (#201027)\n\n##
Summary\r\n\r\nResolves #198911.\r\n\r\nThe result is achieved by
nesting a new filter agg inside the existing\r\n`HEALTHY` agg to remove
any stale SLOs from the ultimate result.\r\n\r\nThis required a
modification of the parsing code on the ES response to\r\ninclude a new
`not_stale` key. The original `success` total is preserved\r\nin the
`doc_count` of that agg, but is no longer referenced.\r\n\r\nThe filter
for the `not_stale` agg I have added is the logical inverse\r\nof the
filter we're using to determine stale SLOs:\r\n\r\n```json\r\n{\r\n
\"range\": {\r\n \"summaryUpdatedAt\": {\r\n \"gte\": \"now-48h\"\r\n
}\r\n }\r\n}\r\n```\r\n\r\n_Reviewer note: I also changed the spelling
of a UI component, should be\r\na completely transparent
change._\r\n\r\n## Example\r\n\r\n### Before\r\n\r\nThis is my local
running on `main`:\r\n\r\n<img width=\"1116\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225\">\r\n\r\n\r\n###
After\r\n\r\nThis is my local running on this PR branch:\r\n\r\n<img
width=\"1120\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2\">\r\n\r\n\r\n###
Proof query works\r\n\r\nYou can replicate these results by including a
similar agg on a query\r\nagainst SLO data. I added a terms agg to the
`stale` agg to determine\r\nhow many SLOs I need to remove. The number
of `HEALTHY` SLOs showing up\r\nin `stale` should match the difference
between the total `doc_count`\r\nfrom `healthy` and the `doc_count` in
the `not_stale` sub-aggregation.\r\n\r\n#### Query\r\n\r\nYou can run
this example aggs:\r\n\r\n```json\r\n{\r\n \"aggs\": {\r\n \"stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"lt\": \"now-48h\"\r\n }\r\n }\r\n },\r\n \"aggs\": {\r\n
\"by_status\": {\r\n \"terms\": {\r\n \"field\": \"status\"\r\n }\r\n
}\r\n }\r\n },\r\n \"healthy\": {\r\n \"filter\": {\r\n \"term\": {\r\n
\"status\": \"HEALTHY\"\r\n }\r\n },\r\n \"aggs\": {\r\n \"not_stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"gte\": \"now-48h\"\r\n }\r\n }\r\n }\r\n }\r\n }\r\n }\r\n
}\r\n}\r\n```\r\n\r\n#### Relevant output\r\n\r\nHere's a subset of my
example query output. You can see that\r\n`stale.by_status.buckets[1]`
contains a total of 2 docs, which is the\r\ndifference between
`healthy.doc_count`
and\r\n`healthy.not_stale.doc_count`.\r\n\r\n```json\r\n{\r\n \"stale\":
{\r\n \"doc_count\": 7,\r\n \"by_status\": {\r\n
\"doc_count_error_upper_bound\": 0,\r\n \"sum_other_doc_count\": 0,\r\n
\"buckets\": [\r\n {\r\n \"key\": \"VIOLATED\",\r\n \"doc_count\": 5\r\n
},\r\n {\r\n \"key\": \"HEALTHY\",\r\n \"doc_count\": 2\r\n }\r\n ]\r\n
}\r\n },\r\n \"healthy\": {\r\n \"doc_count\": 9,\r\n \"not_stale\":
{\r\n \"doc_count\": 7\r\n }\r\n
}\r\n}\r\n```","sha":"a92103b2a9c06e3af30dea591ac769b995c78145"}},"sourceBranch":"main","suggestedTargetBranches":["8.17"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/201027","number":201027,"mergeCommit":{"message":"[SLO]
Exclude stale slos from healthy count on overview (#201027)\n\n##
Summary\r\n\r\nResolves #198911.\r\n\r\nThe result is achieved by
nesting a new filter agg inside the existing\r\n`HEALTHY` agg to remove
any stale SLOs from the ultimate result.\r\n\r\nThis required a
modification of the parsing code on the ES response to\r\ninclude a new
`not_stale` key. The original `success` total is preserved\r\nin the
`doc_count` of that agg, but is no longer referenced.\r\n\r\nThe filter
for the `not_stale` agg I have added is the logical inverse\r\nof the
filter we're using to determine stale SLOs:\r\n\r\n```json\r\n{\r\n
\"range\": {\r\n \"summaryUpdatedAt\": {\r\n \"gte\": \"now-48h\"\r\n
}\r\n }\r\n}\r\n```\r\n\r\n_Reviewer note: I also changed the spelling
of a UI component, should be\r\na completely transparent
change._\r\n\r\n## Example\r\n\r\n### Before\r\n\r\nThis is my local
running on `main`:\r\n\r\n<img width=\"1116\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225\">\r\n\r\n\r\n###
After\r\n\r\nThis is my local running on this PR branch:\r\n\r\n<img
width=\"1120\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2\">\r\n\r\n\r\n###
Proof query works\r\n\r\nYou can replicate these results by including a
similar agg on a query\r\nagainst SLO data. I added a terms agg to the
`stale` agg to determine\r\nhow many SLOs I need to remove. The number
of `HEALTHY` SLOs showing up\r\nin `stale` should match the difference
between the total `doc_count`\r\nfrom `healthy` and the `doc_count` in
the `not_stale` sub-aggregation.\r\n\r\n#### Query\r\n\r\nYou can run
this example aggs:\r\n\r\n```json\r\n{\r\n \"aggs\": {\r\n \"stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"lt\": \"now-48h\"\r\n }\r\n }\r\n },\r\n \"aggs\": {\r\n
\"by_status\": {\r\n \"terms\": {\r\n \"field\": \"status\"\r\n }\r\n
}\r\n }\r\n },\r\n \"healthy\": {\r\n \"filter\": {\r\n \"term\": {\r\n
\"status\": \"HEALTHY\"\r\n }\r\n },\r\n \"aggs\": {\r\n \"not_stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"gte\": \"now-48h\"\r\n }\r\n }\r\n }\r\n }\r\n }\r\n }\r\n
}\r\n}\r\n```\r\n\r\n#### Relevant output\r\n\r\nHere's a subset of my
example query output. You can see that\r\n`stale.by_status.buckets[1]`
contains a total of 2 docs, which is the\r\ndifference between
`healthy.doc_count`
and\r\n`healthy.not_stale.doc_count`.\r\n\r\n```json\r\n{\r\n \"stale\":
{\r\n \"doc_count\": 7,\r\n \"by_status\": {\r\n
\"doc_count_error_upper_bound\": 0,\r\n \"sum_other_doc_count\": 0,\r\n
\"buckets\": [\r\n {\r\n \"key\": \"VIOLATED\",\r\n \"doc_count\": 5\r\n
},\r\n {\r\n \"key\": \"HEALTHY\",\r\n \"doc_count\": 2\r\n }\r\n ]\r\n
}\r\n },\r\n \"healthy\": {\r\n \"doc_count\": 9,\r\n \"not_stale\":
{\r\n \"doc_count\": 7\r\n }\r\n
}\r\n}\r\n```","sha":"a92103b2a9c06e3af30dea591ac769b995c78145"}},{"branch":"8.17","label":"v8.17.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Justin Kambic <[email protected]>
kibanamachine added a commit that referenced this pull request Nov 26, 2024
… (#201831)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[SLO] Exclude stale slos from healthy count on overview
(#201027)](#201027)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Justin
Kambic","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-26T16:23:20Z","message":"[SLO]
Exclude stale slos from healthy count on overview (#201027)\n\n##
Summary\r\n\r\nResolves #198911.\r\n\r\nThe result is achieved by
nesting a new filter agg inside the existing\r\n`HEALTHY` agg to remove
any stale SLOs from the ultimate result.\r\n\r\nThis required a
modification of the parsing code on the ES response to\r\ninclude a new
`not_stale` key. The original `success` total is preserved\r\nin the
`doc_count` of that agg, but is no longer referenced.\r\n\r\nThe filter
for the `not_stale` agg I have added is the logical inverse\r\nof the
filter we're using to determine stale SLOs:\r\n\r\n```json\r\n{\r\n
\"range\": {\r\n \"summaryUpdatedAt\": {\r\n \"gte\": \"now-48h\"\r\n
}\r\n }\r\n}\r\n```\r\n\r\n_Reviewer note: I also changed the spelling
of a UI component, should be\r\na completely transparent
change._\r\n\r\n## Example\r\n\r\n### Before\r\n\r\nThis is my local
running on `main`:\r\n\r\n<img width=\"1116\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225\">\r\n\r\n\r\n###
After\r\n\r\nThis is my local running on this PR branch:\r\n\r\n<img
width=\"1120\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2\">\r\n\r\n\r\n###
Proof query works\r\n\r\nYou can replicate these results by including a
similar agg on a query\r\nagainst SLO data. I added a terms agg to the
`stale` agg to determine\r\nhow many SLOs I need to remove. The number
of `HEALTHY` SLOs showing up\r\nin `stale` should match the difference
between the total `doc_count`\r\nfrom `healthy` and the `doc_count` in
the `not_stale` sub-aggregation.\r\n\r\n#### Query\r\n\r\nYou can run
this example aggs:\r\n\r\n```json\r\n{\r\n \"aggs\": {\r\n \"stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"lt\": \"now-48h\"\r\n }\r\n }\r\n },\r\n \"aggs\": {\r\n
\"by_status\": {\r\n \"terms\": {\r\n \"field\": \"status\"\r\n }\r\n
}\r\n }\r\n },\r\n \"healthy\": {\r\n \"filter\": {\r\n \"term\": {\r\n
\"status\": \"HEALTHY\"\r\n }\r\n },\r\n \"aggs\": {\r\n \"not_stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"gte\": \"now-48h\"\r\n }\r\n }\r\n }\r\n }\r\n }\r\n }\r\n
}\r\n}\r\n```\r\n\r\n#### Relevant output\r\n\r\nHere's a subset of my
example query output. You can see that\r\n`stale.by_status.buckets[1]`
contains a total of 2 docs, which is the\r\ndifference between
`healthy.doc_count`
and\r\n`healthy.not_stale.doc_count`.\r\n\r\n```json\r\n{\r\n \"stale\":
{\r\n \"doc_count\": 7,\r\n \"by_status\": {\r\n
\"doc_count_error_upper_bound\": 0,\r\n \"sum_other_doc_count\": 0,\r\n
\"buckets\": [\r\n {\r\n \"key\": \"VIOLATED\",\r\n \"doc_count\": 5\r\n
},\r\n {\r\n \"key\": \"HEALTHY\",\r\n \"doc_count\": 2\r\n }\r\n ]\r\n
}\r\n },\r\n \"healthy\": {\r\n \"doc_count\": 9,\r\n \"not_stale\":
{\r\n \"doc_count\": 7\r\n }\r\n
}\r\n}\r\n```","sha":"a92103b2a9c06e3af30dea591ac769b995c78145","branchLabelMapping":{"^v9.0.0$":"main","^v8.18.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","v9.0.0","backport:prev-minor","ci:project-deploy-observability","Team:obs-ux-management","v8.17.0"],"title":"[SLO]
Exclude stale slos from healthy count on
overview","number":201027,"url":"https://github.com/elastic/kibana/pull/201027","mergeCommit":{"message":"[SLO]
Exclude stale slos from healthy count on overview (#201027)\n\n##
Summary\r\n\r\nResolves #198911.\r\n\r\nThe result is achieved by
nesting a new filter agg inside the existing\r\n`HEALTHY` agg to remove
any stale SLOs from the ultimate result.\r\n\r\nThis required a
modification of the parsing code on the ES response to\r\ninclude a new
`not_stale` key. The original `success` total is preserved\r\nin the
`doc_count` of that agg, but is no longer referenced.\r\n\r\nThe filter
for the `not_stale` agg I have added is the logical inverse\r\nof the
filter we're using to determine stale SLOs:\r\n\r\n```json\r\n{\r\n
\"range\": {\r\n \"summaryUpdatedAt\": {\r\n \"gte\": \"now-48h\"\r\n
}\r\n }\r\n}\r\n```\r\n\r\n_Reviewer note: I also changed the spelling
of a UI component, should be\r\na completely transparent
change._\r\n\r\n## Example\r\n\r\n### Before\r\n\r\nThis is my local
running on `main`:\r\n\r\n<img width=\"1116\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225\">\r\n\r\n\r\n###
After\r\n\r\nThis is my local running on this PR branch:\r\n\r\n<img
width=\"1120\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2\">\r\n\r\n\r\n###
Proof query works\r\n\r\nYou can replicate these results by including a
similar agg on a query\r\nagainst SLO data. I added a terms agg to the
`stale` agg to determine\r\nhow many SLOs I need to remove. The number
of `HEALTHY` SLOs showing up\r\nin `stale` should match the difference
between the total `doc_count`\r\nfrom `healthy` and the `doc_count` in
the `not_stale` sub-aggregation.\r\n\r\n#### Query\r\n\r\nYou can run
this example aggs:\r\n\r\n```json\r\n{\r\n \"aggs\": {\r\n \"stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"lt\": \"now-48h\"\r\n }\r\n }\r\n },\r\n \"aggs\": {\r\n
\"by_status\": {\r\n \"terms\": {\r\n \"field\": \"status\"\r\n }\r\n
}\r\n }\r\n },\r\n \"healthy\": {\r\n \"filter\": {\r\n \"term\": {\r\n
\"status\": \"HEALTHY\"\r\n }\r\n },\r\n \"aggs\": {\r\n \"not_stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"gte\": \"now-48h\"\r\n }\r\n }\r\n }\r\n }\r\n }\r\n }\r\n
}\r\n}\r\n```\r\n\r\n#### Relevant output\r\n\r\nHere's a subset of my
example query output. You can see that\r\n`stale.by_status.buckets[1]`
contains a total of 2 docs, which is the\r\ndifference between
`healthy.doc_count`
and\r\n`healthy.not_stale.doc_count`.\r\n\r\n```json\r\n{\r\n \"stale\":
{\r\n \"doc_count\": 7,\r\n \"by_status\": {\r\n
\"doc_count_error_upper_bound\": 0,\r\n \"sum_other_doc_count\": 0,\r\n
\"buckets\": [\r\n {\r\n \"key\": \"VIOLATED\",\r\n \"doc_count\": 5\r\n
},\r\n {\r\n \"key\": \"HEALTHY\",\r\n \"doc_count\": 2\r\n }\r\n ]\r\n
}\r\n },\r\n \"healthy\": {\r\n \"doc_count\": 9,\r\n \"not_stale\":
{\r\n \"doc_count\": 7\r\n }\r\n
}\r\n}\r\n```","sha":"a92103b2a9c06e3af30dea591ac769b995c78145"}},"sourceBranch":"main","suggestedTargetBranches":["8.17"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/201027","number":201027,"mergeCommit":{"message":"[SLO]
Exclude stale slos from healthy count on overview (#201027)\n\n##
Summary\r\n\r\nResolves #198911.\r\n\r\nThe result is achieved by
nesting a new filter agg inside the existing\r\n`HEALTHY` agg to remove
any stale SLOs from the ultimate result.\r\n\r\nThis required a
modification of the parsing code on the ES response to\r\ninclude a new
`not_stale` key. The original `success` total is preserved\r\nin the
`doc_count` of that agg, but is no longer referenced.\r\n\r\nThe filter
for the `not_stale` agg I have added is the logical inverse\r\nof the
filter we're using to determine stale SLOs:\r\n\r\n```json\r\n{\r\n
\"range\": {\r\n \"summaryUpdatedAt\": {\r\n \"gte\": \"now-48h\"\r\n
}\r\n }\r\n}\r\n```\r\n\r\n_Reviewer note: I also changed the spelling
of a UI component, should be\r\na completely transparent
change._\r\n\r\n## Example\r\n\r\n### Before\r\n\r\nThis is my local
running on `main`:\r\n\r\n<img width=\"1116\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225\">\r\n\r\n\r\n###
After\r\n\r\nThis is my local running on this PR branch:\r\n\r\n<img
width=\"1120\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2\">\r\n\r\n\r\n###
Proof query works\r\n\r\nYou can replicate these results by including a
similar agg on a query\r\nagainst SLO data. I added a terms agg to the
`stale` agg to determine\r\nhow many SLOs I need to remove. The number
of `HEALTHY` SLOs showing up\r\nin `stale` should match the difference
between the total `doc_count`\r\nfrom `healthy` and the `doc_count` in
the `not_stale` sub-aggregation.\r\n\r\n#### Query\r\n\r\nYou can run
this example aggs:\r\n\r\n```json\r\n{\r\n \"aggs\": {\r\n \"stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"lt\": \"now-48h\"\r\n }\r\n }\r\n },\r\n \"aggs\": {\r\n
\"by_status\": {\r\n \"terms\": {\r\n \"field\": \"status\"\r\n }\r\n
}\r\n }\r\n },\r\n \"healthy\": {\r\n \"filter\": {\r\n \"term\": {\r\n
\"status\": \"HEALTHY\"\r\n }\r\n },\r\n \"aggs\": {\r\n \"not_stale\":
{\r\n \"filter\": {\r\n \"range\": {\r\n \"summaryUpdatedAt\": {\r\n
\"gte\": \"now-48h\"\r\n }\r\n }\r\n }\r\n }\r\n }\r\n }\r\n
}\r\n}\r\n```\r\n\r\n#### Relevant output\r\n\r\nHere's a subset of my
example query output. You can see that\r\n`stale.by_status.buckets[1]`
contains a total of 2 docs, which is the\r\ndifference between
`healthy.doc_count`
and\r\n`healthy.not_stale.doc_count`.\r\n\r\n```json\r\n{\r\n \"stale\":
{\r\n \"doc_count\": 7,\r\n \"by_status\": {\r\n
\"doc_count_error_upper_bound\": 0,\r\n \"sum_other_doc_count\": 0,\r\n
\"buckets\": [\r\n {\r\n \"key\": \"VIOLATED\",\r\n \"doc_count\": 5\r\n
},\r\n {\r\n \"key\": \"HEALTHY\",\r\n \"doc_count\": 2\r\n }\r\n ]\r\n
}\r\n },\r\n \"healthy\": {\r\n \"doc_count\": 9,\r\n \"not_stale\":
{\r\n \"doc_count\": 7\r\n }\r\n
}\r\n}\r\n```","sha":"a92103b2a9c06e3af30dea591ac769b995c78145"}},{"branch":"8.17","label":"v8.17.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Justin Kambic <[email protected]>
paulinashakirova pushed a commit to paulinashakirova/kibana that referenced this pull request Nov 26, 2024
## Summary

Resolves elastic#198911.

The result is achieved by nesting a new filter agg inside the existing
`HEALTHY` agg to remove any stale SLOs from the ultimate result.

This required a modification of the parsing code on the ES response to
include a new `not_stale` key. The original `success` total is preserved
in the `doc_count` of that agg, but is no longer referenced.

The filter for the `not_stale` agg I have added is the logical inverse
of the filter we're using to determine stale SLOs:

```json
{
  "range": {
    "summaryUpdatedAt": {
      "gte": "now-48h"
    }
  }
}
```

_Reviewer note: I also changed the spelling of a UI component, should be
a completely transparent change._

## Example

### Before

This is my local running on `main`:

<img width="1116" alt="image"
src="https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225">


### After

This is my local running on this PR branch:

<img width="1120" alt="image"
src="https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2">


### Proof query works

You can replicate these results by including a similar agg on a query
against SLO data. I added a terms agg to the `stale` agg to determine
how many SLOs I need to remove. The number of `HEALTHY` SLOs showing up
in `stale` should match the difference between the total `doc_count`
from `healthy` and the `doc_count` in the `not_stale` sub-aggregation.

#### Query

You can run this example aggs:

```json
{
  "aggs": {
    "stale": {
      "filter": {
        "range": {
          "summaryUpdatedAt": {
            "lt": "now-48h"
          }
        }
      },
      "aggs": {
        "by_status": {
          "terms": {
            "field": "status"
          }
        }
      }
    },
    "healthy": {
      "filter": {
        "term": {
          "status": "HEALTHY"
        }
      },
      "aggs": {
        "not_stale": {
          "filter": {
            "range": {
              "summaryUpdatedAt": {
                "gte": "now-48h"
              }
            }
          }
        }
      }
    }
  }
}
```

#### Relevant output

Here's a subset of my example query output. You can see that
`stale.by_status.buckets[1]` contains a total of 2 docs, which is the
difference between `healthy.doc_count` and
`healthy.not_stale.doc_count`.

```json
{
  "stale": {
    "doc_count": 7,
    "by_status": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "VIOLATED",
          "doc_count": 5
        },
        {
          "key": "HEALTHY",
          "doc_count": 2
        }
      ]
    }
  },
  "healthy": {
    "doc_count": 9,
    "not_stale": {
      "doc_count": 7
    }
  }
}
```
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Dec 12, 2024
## Summary

Resolves elastic#198911.

The result is achieved by nesting a new filter agg inside the existing
`HEALTHY` agg to remove any stale SLOs from the ultimate result.

This required a modification of the parsing code on the ES response to
include a new `not_stale` key. The original `success` total is preserved
in the `doc_count` of that agg, but is no longer referenced.

The filter for the `not_stale` agg I have added is the logical inverse
of the filter we're using to determine stale SLOs:

```json
{
  "range": {
    "summaryUpdatedAt": {
      "gte": "now-48h"
    }
  }
}
```

_Reviewer note: I also changed the spelling of a UI component, should be
a completely transparent change._

## Example

### Before

This is my local running on `main`:

<img width="1116" alt="image"
src="https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225">


### After

This is my local running on this PR branch:

<img width="1120" alt="image"
src="https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2">


### Proof query works

You can replicate these results by including a similar agg on a query
against SLO data. I added a terms agg to the `stale` agg to determine
how many SLOs I need to remove. The number of `HEALTHY` SLOs showing up
in `stale` should match the difference between the total `doc_count`
from `healthy` and the `doc_count` in the `not_stale` sub-aggregation.

#### Query

You can run this example aggs:

```json
{
  "aggs": {
    "stale": {
      "filter": {
        "range": {
          "summaryUpdatedAt": {
            "lt": "now-48h"
          }
        }
      },
      "aggs": {
        "by_status": {
          "terms": {
            "field": "status"
          }
        }
      }
    },
    "healthy": {
      "filter": {
        "term": {
          "status": "HEALTHY"
        }
      },
      "aggs": {
        "not_stale": {
          "filter": {
            "range": {
              "summaryUpdatedAt": {
                "gte": "now-48h"
              }
            }
          }
        }
      }
    }
  }
}
```

#### Relevant output

Here's a subset of my example query output. You can see that
`stale.by_status.buckets[1]` contains a total of 2 docs, which is the
difference between `healthy.doc_count` and
`healthy.not_stale.doc_count`.

```json
{
  "stale": {
    "doc_count": 7,
    "by_status": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "VIOLATED",
          "doc_count": 5
        },
        {
          "key": "HEALTHY",
          "doc_count": 2
        }
      ]
    }
  },
  "healthy": {
    "doc_count": 9,
    "not_stale": {
      "doc_count": 7
    }
  }
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) ci:project-deploy-observability Create an Observability project release_note:enhancement Team:obs-ux-management Observability Management User Experience Team v8.17.0 v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SLOs] Subtract out stale SLO count from healthy slos in overview !!
5 participants