[SLO] Exclude stale slos from healthy count on overview #201027

justinkambic · 2024-11-20T19:08:05Z

Summary

Resolves #198911.

The result is achieved by nesting a new filter agg inside the existing HEALTHY agg to remove any stale SLOs from the ultimate result.

This required a modification of the parsing code on the ES response to include a new not_stale key. The original success total is preserved in the doc_count of that agg, but is no longer referenced.

The filter for the not_stale agg I have added is the logical inverse of the filter we're using to determine stale SLOs:

{
  "range": {
    "summaryUpdatedAt": {
      "gte": "now-48h"
    }
  }
}

Reviewer note: I also changed the spelling of a UI component, should be a completely transparent change.

Example

Before

This is my local running on main:

After

This is my local running on this PR branch:

Proof query works

You can replicate these results by including a similar agg on a query against SLO data. I added a terms agg to the stale agg to determine how many SLOs I need to remove. The number of HEALTHY SLOs showing up in stale should match the difference between the total doc_count from healthy and the doc_count in the not_stale sub-aggregation.

Query

You can run this example aggs:

{
  "aggs": {
    "stale": {
      "filter": {
        "range": {
          "summaryUpdatedAt": {
            "lt": "now-48h"
          }
        }
      },
      "aggs": {
        "by_status": {
          "terms": {
            "field": "status"
          }
        }
      }
    },
    "healthy": {
      "filter": {
        "term": {
          "status": "HEALTHY"
        }
      },
      "aggs": {
        "not_stale": {
          "filter": {
            "range": {
              "summaryUpdatedAt": {
                "gte": "now-48h"
              }
            }
          }
        }
      }
    }
  }
}

Relevant output

Here's a subset of my example query output. You can see that stale.by_status.buckets[1] contains a total of 2 docs, which is the difference between healthy.doc_count and healthy.not_stale.doc_count.

{
  "stale": {
    "doc_count": 7,
    "by_status": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "VIOLATED",
          "doc_count": 5
        },
        {
          "key": "HEALTHY",
          "doc_count": 2
        }
      ]
    }
  },
  "healthy": {
    "doc_count": 9,
    "not_stale": {
      "doc_count": 7
    }
  }
}

elasticmachine · 2024-11-20T19:08:07Z

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

github-actions · 2024-11-20T19:08:19Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

shahzad31 · 2024-11-21T17:35:24Z

x-pack/plugins/observability_solution/slo/server/services/get_slos_overview.ts

@@ -133,7 +144,7 @@ export class GetSLOsOverview {
    return {
      violated: aggs?.violated.doc_count ?? 0,
      degrading: aggs?.degrading.doc_count ?? 0,
-      healthy: aggs?.healthy.doc_count ?? 0,
+      healthy: aggs?.healthy?.not_stale?.doc_count ?? 0,


now that i am thinking, i think same should be subtracted from degrading and violated SLOs.

We talked about this offline and agreed we can apply this filtering further up rather than adding sub-aggregations for all of the non-stale filters. I'll ping when this is done.

kdelemme · 2024-11-25T15:40:16Z

...gins/observability_solution/slo/public/pages/slos/components/slos_overview/overview_item.tsx

@@ -9,7 +9,7 @@ import { EuiFlexItem, EuiStat, EuiToolTip } from '@elastic/eui';
 import React from 'react';
 import { useUrlSearchState } from '../../hooks/use_url_search_state';

-export function OverViewItem({
+export function OverviewItem({


kdelemme

I think we need to keep using the settings, and also if we can double check the usage of worst. I have the feeling it is not used and could be 🔪
Otherwise, looks good to me.

x-pack/plugins/observability_solution/slo/server/services/get_slos_overview.ts

kdelemme

Just one thing to cleanup but otherwise 👍🏻

elasticmachine · 2024-11-26T15:41:29Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 92033ef
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-201027-92033efbd663

Failed CI Steps

FTR Configs #9

Test Failures

[job] [logs] FTR Configs #9 / management Index patterns on aliases discover verify hits should be able to discover and verify no of hits for alias2

Metrics [docs]

✅ unchanged

History

💔 Build #254175 failed 566e2ac
💚 Build #254089 succeeded 060ebcc
💚 Build #253710 succeeded 1186417
💚 Build #253250 succeeded 1bfa7e7
💔 Build #252909 failed 5ad2aa9

cc @kdelemme @justinkambic

kibanamachine · 2024-11-26T16:23:40Z

Starting backport for target branches: 8.17, 8.x

https://github.com/elastic/kibana/actions/runs/12034819890

## Summary Resolves elastic#198911. The result is achieved by nesting a new filter agg inside the existing `HEALTHY` agg to remove any stale SLOs from the ultimate result. This required a modification of the parsing code on the ES response to include a new `not_stale` key. The original `success` total is preserved in the `doc_count` of that agg, but is no longer referenced. The filter for the `not_stale` agg I have added is the logical inverse of the filter we're using to determine stale SLOs: ```json { "range": { "summaryUpdatedAt": { "gte": "now-48h" } } } ``` _Reviewer note: I also changed the spelling of a UI component, should be a completely transparent change._ ## Example ### Before This is my local running on `main`: <img width="1116" alt="image" src="https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225"> ### After This is my local running on this PR branch: <img width="1120" alt="image" src="https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2"> ### Proof query works You can replicate these results by including a similar agg on a query against SLO data. I added a terms agg to the `stale` agg to determine how many SLOs I need to remove. The number of `HEALTHY` SLOs showing up in `stale` should match the difference between the total `doc_count` from `healthy` and the `doc_count` in the `not_stale` sub-aggregation. #### Query You can run this example aggs: ```json { "aggs": { "stale": { "filter": { "range": { "summaryUpdatedAt": { "lt": "now-48h" } } }, "aggs": { "by_status": { "terms": { "field": "status" } } } }, "healthy": { "filter": { "term": { "status": "HEALTHY" } }, "aggs": { "not_stale": { "filter": { "range": { "summaryUpdatedAt": { "gte": "now-48h" } } } } } } } } ``` #### Relevant output Here's a subset of my example query output. You can see that `stale.by_status.buckets[1]` contains a total of 2 docs, which is the difference between `healthy.doc_count` and `healthy.not_stale.doc_count`. ```json { "stale": { "doc_count": 7, "by_status": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "VIOLATED", "doc_count": 5 }, { "key": "HEALTHY", "doc_count": 2 } ] } }, "healthy": { "doc_count": 9, "not_stale": { "doc_count": 7 } } } ``` (cherry picked from commit a92103b)

kibanamachine · 2024-11-26T16:29:26Z

💚 All backports created successfully

Status	Branch	Result
✅	8.17
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…) (#201830) # Backport This will backport the following commits from `main` to `8.17`: - [[SLO] Exclude stale slos from healthy count on overview (#201027)](#201027)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Justin Kambic <[email protected]>

… (#201831) # Backport This will backport the following commits from `main` to `8.x`: - [[SLO] Exclude stale slos from healthy count on overview (#201027)](#201027)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Justin Kambic <[email protected]>

## Summary Resolves elastic#198911. The result is achieved by nesting a new filter agg inside the existing `HEALTHY` agg to remove any stale SLOs from the ultimate result. This required a modification of the parsing code on the ES response to include a new `not_stale` key. The original `success` total is preserved in the `doc_count` of that agg, but is no longer referenced. The filter for the `not_stale` agg I have added is the logical inverse of the filter we're using to determine stale SLOs: ```json { "range": { "summaryUpdatedAt": { "gte": "now-48h" } } } ``` _Reviewer note: I also changed the spelling of a UI component, should be a completely transparent change._ ## Example ### Before This is my local running on `main`: <img width="1116" alt="image" src="https://github.com/user-attachments/assets/80f86426-c7f1-4847-830f-a311c865a225"> ### After This is my local running on this PR branch: <img width="1120" alt="image" src="https://github.com/user-attachments/assets/2c4c4f26-2407-41ca-bf01-9ca730bbfab2"> ### Proof query works You can replicate these results by including a similar agg on a query against SLO data. I added a terms agg to the `stale` agg to determine how many SLOs I need to remove. The number of `HEALTHY` SLOs showing up in `stale` should match the difference between the total `doc_count` from `healthy` and the `doc_count` in the `not_stale` sub-aggregation. #### Query You can run this example aggs: ```json { "aggs": { "stale": { "filter": { "range": { "summaryUpdatedAt": { "lt": "now-48h" } } }, "aggs": { "by_status": { "terms": { "field": "status" } } } }, "healthy": { "filter": { "term": { "status": "HEALTHY" } }, "aggs": { "not_stale": { "filter": { "range": { "summaryUpdatedAt": { "gte": "now-48h" } } } } } } } } ``` #### Relevant output Here's a subset of my example query output. You can see that `stale.by_status.buckets[1]` contains a total of 2 docs, which is the difference between `healthy.doc_count` and `healthy.not_stale.doc_count`. ```json { "stale": { "doc_count": 7, "by_status": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "VIOLATED", "doc_count": 5 }, { "key": "HEALTHY", "doc_count": 2 } ] } }, "healthy": { "doc_count": 9, "not_stale": { "doc_count": 7 } } } ```

justinkambic added release_note:enhancement v9.0.0 backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Team:obs-ux-management Observability Management User Experience Team v8.17.0 labels Nov 20, 2024

justinkambic self-assigned this Nov 20, 2024

justinkambic requested a review from a team as a code owner November 20, 2024 19:08

botelastic bot added the ci:project-deploy-observability Create an Observability project label Nov 20, 2024

kdelemme self-assigned this Nov 20, 2024

shahzad31 reviewed Nov 21, 2024

View reviewed changes

justinkambic force-pushed the 198911/exclude-stale-slos-from-healthy-count-on-overview branch from 1bfa7e7 to 1186417 Compare November 22, 2024 18:25

justinkambic requested a review from shahzad31 November 22, 2024 18:26

kdelemme reviewed Nov 25, 2024

View reviewed changes

x-pack/plugins/observability_solution/slo/server/services/get_slos_overview.ts Outdated Show resolved Hide resolved

x-pack/plugins/observability_solution/slo/server/services/get_slos_overview.ts Outdated Show resolved Hide resolved

justinkambic force-pushed the 198911/exclude-stale-slos-from-healthy-count-on-overview branch from 060ebcc to 566e2ac Compare November 25, 2024 17:43

justinkambic added 5 commits November 26, 2024 09:05

Rename <OverViewItem /> to <OverviewItem />

b5736c3

Add filter agg to healthy SLO agg to remove stale SLOs from doc_count.

fa3671e

Apply not_stale filter to all sub-filters, except stale.

e62cf58

Remove unused filter.

a306667

Fix regression that ignores stale SLO settings.

fcd65bd

justinkambic force-pushed the 198911/exclude-stale-slos-from-healthy-count-on-overview branch from 566e2ac to fcd65bd Compare November 26, 2024 14:07

justinkambic requested a review from kdelemme November 26, 2024 14:09

kdelemme reviewed Nov 26, 2024

View reviewed changes

x-pack/plugins/observability_solution/slo/server/services/get_slos_overview.ts Outdated Show resolved Hide resolved

kdelemme approved these changes Nov 26, 2024

View reviewed changes

Remove unused type and key from API response.

92033ef

justinkambic enabled auto-merge (squash) November 26, 2024 14:29

justinkambic merged commit a92103b into elastic:main Nov 26, 2024
26 checks passed

kibanamachine mentioned this pull request Nov 26, 2024

[8.17] [SLO] Exclude stale slos from healthy count on overview (#201027) #201830

Merged

kibanamachine mentioned this pull request Nov 26, 2024

[8.x] [SLO] Exclude stale slos from healthy count on overview (#201027) #201831

Merged

kdelemme added the v8.18.0 label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLO] Exclude stale slos from healthy count on overview #201027

[SLO] Exclude stale slos from healthy count on overview #201027

justinkambic commented Nov 20, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Nov 20, 2024

github-actions bot commented Nov 20, 2024

shahzad31 Nov 21, 2024 •

edited

Loading

justinkambic Nov 22, 2024

kdelemme Nov 25, 2024

kdelemme left a comment

kdelemme left a comment

elasticmachine commented Nov 26, 2024 •

edited

Loading

kibanamachine commented Nov 26, 2024

kibanamachine commented Nov 26, 2024

[SLO] Exclude stale slos from healthy count on overview #201027

[SLO] Exclude stale slos from healthy count on overview #201027

Conversation

justinkambic commented Nov 20, 2024 • edited by kibanamachine Loading

Summary

Example

Before

After

Proof query works

Query

Relevant output

elasticmachine commented Nov 20, 2024

github-actions bot commented Nov 20, 2024

🤖 GitHub comments

shahzad31 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

justinkambic Nov 22, 2024

Choose a reason for hiding this comment

kdelemme Nov 25, 2024

Choose a reason for hiding this comment

kdelemme left a comment

Choose a reason for hiding this comment

kdelemme left a comment

Choose a reason for hiding this comment

elasticmachine commented Nov 26, 2024 • edited Loading

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

kibanamachine commented Nov 26, 2024

kibanamachine commented Nov 26, 2024

💚 All backports created successfully

Questions ?

justinkambic commented Nov 20, 2024 •

edited by kibanamachine

Loading

shahzad31 Nov 21, 2024 •

edited

Loading

elasticmachine commented Nov 26, 2024 •

edited

Loading