[SLO] Account for the built-in delay for burn rate alerting #169011

simianhacker · 2023-10-16T17:29:22Z

Summary

This PR introduces a delay based on:

The interval of the date histogram which defaults to 1m
The sync delay of the transform which defaults to 1m
The frequency of the transform which defaults to 1m

On average the SLO data is about 180s behind for occurances. If the user uses 5m time slices then the delay is around 420s. This PR attempts to mitigate this delay by subtracting the interval + syncDelay + frequency from now on any calculation for the burn rate so that the last 5 minutes is aligned with the data. Below is a visualization that shows how much delay we are seeing in an optimal environment.

Using interval + syncDelay + frequency accounts for the "best case scenario". Due to the nature of the transform system, the delays varies from best case of 180s for occurrences to worst case of around 240s which happens right before the next scheduled query; the transform query runs every 60s which accounts for the variation between the worst and best case delay. Since the rules run on a seperate schedule, it's hard to know where we are in the 60s cycle so the best we can do is account for the "best case".

This PR also fixes #168747

Note to the reviewer

The changes made to evaluate.ts and build_query.ts look more extensive than they really are. I felt like #168735 made some unnecessary refactors when they simply could have done a minimal change and left the rest of the code alone; it would have been less risky. This also cause several issues during the merge which is why I ultimately decided to revert the changes from #168735.

apmmachine · 2023-10-16T17:29:38Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
/oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-10-16T22:11:44Z

Pinging @elastic/actionable-observability (Team: Actionable Observability)

kdelemme

Everything looks good to me. I have just a suggestion for the sli client and the added startedAt param for testing purpose.
Going to run it locally with some timeslice SLO now

x-pack/plugins/observability/public/utils/slo/get_delay_in_seconds_from_slo.ts

x-pack/plugins/observability/public/components/slo/error_rate_chart/use_lens_definition.ts

x-pack/plugins/observability/server/domain/services/get_delay_in_seconds_from_slo.ts

x-pack/plugins/observability/server/lib/rules/slo_burn_rate/executor.ts

x-pack/plugins/observability/server/services/slo/sli_client.ts

x-pack/plugins/observability/server/services/slo/sli_client.test.ts

kdelemme

Tested locally and work as expected 👍🏻
Well done with the refactoring of the burn rate executor 👍🏻

kibana-ci · 2023-10-19T17:06:16Z

💚 Build Succeeded

Buildkite Build
Commit: 5fabc2a

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`observability`	486	487	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`observability`	1.1MB	1.1MB	+611.0B

History

💔 Build #169524 failed 6d9a6b1
💔 Build #169480 failed 14ed128
💛 Build #169281 was flaky 015241e
💚 Build #168304 succeeded 3b5bf3c

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

kibanamachine · 2023-10-19T20:15:04Z

💔 All backports failed

Status	Branch	Result
❌	8.11	Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 169011

Questions ?

Please refer to the Backport tool documentation

simianhacker · 2023-10-25T14:54:01Z

💚 All backports created successfully

Status	Branch	Result
✅	8.11

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…169011) ## Summary This PR introduces a delay based on: - The interval of the date histogram which defaults to `1m` - The sync delay of the transform which defaults to `1m` - The frequency of the transform which defaults to `1m` On average the SLO data is about `180s` behind for occurances. If the user uses `5m` time slices then the delay is around `420s`. This PR attempts to mitigate this delay by subtracting the `interval + syncDelay + frequency` from `now` on any calculation for the burn rate so that the last 5 minutes is aligned with the data. Below is a visualization that shows how much delay we are seeing in an optimal environment. <img width="884" alt="image" src="https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89"> Using `interval + syncDelay + frequency` accounts for the "best case scenario". Due to the nature of the transform system, the delays varies from best case of `180s` for occurrences to worst case of around `240s` which happens right before the next scheduled query; the transform query runs every `60s` which accounts for the variation between the worst and best case delay. Since the rules run on a seperate schedule, it's hard to know where we are in the `60s` cycle so the best we can do is account for the "best case". This PR also fixes elastic#168747 ### Note to the reviewer The changes made to `evaluate.ts` and `build_query.ts` look more extensive than they really are. I felt like elastic#168735 made some unnecessary refactors when they simply could have done a minimal change and left the rest of the code alone; it would have been less risky. This also cause several issues during the merge which is why I ultimately decided to revert the changes from elastic#168735. (cherry picked from commit dce8eed) # Conflicts: # x-pack/plugins/observability/server/lib/rules/slo_burn_rate/executor.ts # x-pack/plugins/observability/server/lib/rules/slo_burn_rate/lib/build_query.test.ts # x-pack/plugins/observability/server/lib/rules/slo_burn_rate/lib/build_query.ts

…169011) (#169843) # Backport This will backport the following commits from `main` to `8.11`: - [[SLO] Account for the built-in delay for burn rate alerting (#169011)](#169011)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)

[SLO] Account for the built-in delay for burn rate alerting

b487618

Updating tests to reflect the changes by adding the delay

3b5bf3c

simianhacker marked this pull request as ready for review October 16, 2023 22:10

simianhacker requested a review from a team as a code owner October 16, 2023 22:10

simianhacker added 7 commits October 18, 2023 16:29

Changing from dateMath to absolute timestamps

ba39f70

Merge branch 'main' of github.com:elastic/kibana into delay-from-slo

9d3b8ec

Moving alertWithLifecycle to before getAlertUuid

1e8b956

Fixing executor test

04bdb34

Removing extra line

59f7c94

Removing unnecessary clone

015241e

Adding comment

14ed128

kdelemme reviewed Oct 19, 2023

View reviewed changes

simianhacker added 2 commits October 19, 2023 08:55

Merge branch 'main' of github.com:elastic/kibana into delay-from-slo

ce24b83

Use fake timers for sli_client test

6d9a6b1

kdelemme approved these changes Oct 19, 2023

View reviewed changes

fixing test

5fabc2a

simianhacker merged commit dce8eed into elastic:main Oct 19, 2023

simianhacker mentioned this pull request Oct 25, 2023

[8.11] [SLO] Account for the built-in delay for burn rate alerting (#169011) #169843

Merged

kibanamachine added the v8.11.0 label Oct 25, 2023

simianhacker added the Feature:SLO label Dec 6, 2023

simianhacker deleted the delay-from-slo branch April 17, 2024 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLO] Account for the built-in delay for burn rate alerting #169011

[SLO] Account for the built-in delay for burn rate alerting #169011

simianhacker commented Oct 16, 2023 •

edited by kibanamachine

Loading

apmmachine commented Oct 16, 2023

elasticmachine commented Oct 16, 2023

kdelemme left a comment

kdelemme left a comment

kibana-ci commented Oct 19, 2023

kibanamachine commented Oct 19, 2023

simianhacker commented Oct 25, 2023

[SLO] Account for the built-in delay for burn rate alerting #169011

[SLO] Account for the built-in delay for burn rate alerting #169011

Conversation

simianhacker commented Oct 16, 2023 • edited by kibanamachine Loading

Summary

Note to the reviewer

apmmachine commented Oct 16, 2023

🤖 GitHub comments

elasticmachine commented Oct 16, 2023

kdelemme left a comment

Choose a reason for hiding this comment

kdelemme left a comment

Choose a reason for hiding this comment

kibana-ci commented Oct 19, 2023

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

History

kibanamachine commented Oct 19, 2023

💔 All backports failed

Manual backport

Questions ?

simianhacker commented Oct 25, 2023

💚 All backports created successfully

Questions ?

simianhacker commented Oct 16, 2023 •

edited by kibanamachine

Loading