Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLO] Account for the built-in delay for burn rate alerting #169011

Merged
merged 12 commits into from
Oct 19, 2023

Conversation

simianhacker
Copy link
Member

@simianhacker simianhacker commented Oct 16, 2023

Summary

This PR introduces a delay based on:

  • The interval of the date histogram which defaults to 1m
  • The sync delay of the transform which defaults to 1m
  • The frequency of the transform which defaults to 1m

On average the SLO data is about 180s behind for occurances. If the user uses 5m time slices then the delay is around 420s. This PR attempts to mitigate this delay by subtracting the interval + syncDelay + frequency from now on any calculation for the burn rate so that the last 5 minutes is aligned with the data. Below is a visualization that shows how much delay we are seeing in an optimal environment.

image

Using interval + syncDelay + frequency accounts for the "best case scenario". Due to the nature of the transform system, the delays varies from best case of 180s for occurrences to worst case of around 240s which happens right before the next scheduled query; the transform query runs every 60s which accounts for the variation between the worst and best case delay. Since the rules run on a seperate schedule, it's hard to know where we are in the 60s cycle so the best we can do is account for the "best case".

This PR also fixes #168747

Note to the reviewer

The changes made to evaluate.ts and build_query.ts look more extensive than they really are. I felt like #168735 made some unnecessary refactors when they simply could have done a minimal change and left the rest of the code alone; it would have been less risky. This also cause several issues during the merge which is why I ultimately decided to revert the changes from #168735.

@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@simianhacker simianhacker marked this pull request as ready for review October 16, 2023 22:10
@simianhacker simianhacker requested a review from a team as a code owner October 16, 2023 22:10
@simianhacker simianhacker added release_note:fix backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) v8.12.0 Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" labels Oct 16, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

Copy link
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me. I have just a suggestion for the sli client and the added startedAt param for testing purpose.
Going to run it locally with some timeslice SLO now

Copy link
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and work as expected 👍🏻
Well done with the refactoring of the burn rate executor 👍🏻

image

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
observability 486 487 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
observability 1.1MB 1.1MB +611.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@simianhacker simianhacker merged commit dce8eed into elastic:main Oct 19, 2023
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.11 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 169011

Questions ?

Please refer to the Backport tool documentation

@simianhacker
Copy link
Member Author

💚 All backports created successfully

Status Branch Result
8.11

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

simianhacker added a commit to simianhacker/kibana that referenced this pull request Oct 25, 2023
…169011)

## Summary

This PR introduces a delay based on:

- The interval of the date histogram which defaults to `1m`
- The sync delay of the transform which defaults to `1m`
- The frequency of the transform which defaults to `1m`

On average the SLO data is about `180s` behind for occurances. If the
user uses `5m` time slices then the delay is around `420s`. This PR
attempts to mitigate this delay by subtracting the `interval + syncDelay
+ frequency` from `now` on any calculation for the burn rate so that the
last 5 minutes is aligned with the data. Below is a visualization that
shows how much delay we are seeing in an optimal environment.

<img width="884" alt="image"
src="https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89">

Using `interval + syncDelay + frequency` accounts for the "best case
scenario". Due to the nature of the transform system, the delays varies
from best case of `180s` for occurrences to worst case of around `240s`
which happens right before the next scheduled query; the transform query
runs every `60s` which accounts for the variation between the worst and
best case delay. Since the rules run on a seperate schedule, it's hard
to know where we are in the `60s` cycle so the best we can do is account
for the "best case".

This PR also fixes elastic#168747

### Note to the reviewer

The changes made to `evaluate.ts` and `build_query.ts` look more
extensive than they really are. I felt like elastic#168735 made some
unnecessary refactors when they simply could have done a minimal change
and left the rest of the code alone; it would have been less risky. This
also cause several issues during the merge which is why I ultimately
decided to revert the changes from elastic#168735.

(cherry picked from commit dce8eed)

# Conflicts:
#	x-pack/plugins/observability/server/lib/rules/slo_burn_rate/executor.ts
#	x-pack/plugins/observability/server/lib/rules/slo_burn_rate/lib/build_query.test.ts
#	x-pack/plugins/observability/server/lib/rules/slo_burn_rate/lib/build_query.ts
simianhacker added a commit that referenced this pull request Oct 25, 2023
…169011) (#169843)

# Backport

This will backport the following commits from `main` to `8.11`:
- [[SLO] Account for the built-in delay for burn rate alerting
(#169011)](#169011)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Chris
Cowan","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-10-19T20:09:18Z","message":"[SLO]
Account for the built-in delay for burn rate alerting (#169011)\n\n##
Summary\r\n\r\nThis PR introduces a delay based on:\r\n\r\n- The
interval of the date histogram which defaults to `1m`\r\n- The sync
delay of the transform which defaults to `1m`\r\n- The frequency of the
transform which defaults to `1m`\r\n\r\nOn average the SLO data is about
`180s` behind for occurances. If the\r\nuser uses `5m` time slices then
the delay is around `420s`. This PR\r\nattempts to mitigate this delay
by subtracting the `interval + syncDelay\r\n+ frequency` from `now` on
any calculation for the burn rate so that the\r\nlast 5 minutes is
aligned with the data. Below is a visualization that\r\nshows how much
delay we are seeing in an optimal environment.\r\n\r\n<img width=\"884\"
alt=\"image\"\r\nsrc=\"https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89\">\r\n\r\nUsing
`interval + syncDelay + frequency` accounts for the \"best
case\r\nscenario\". Due to the nature of the transform system, the
delays varies\r\nfrom best case of `180s` for occurrences to worst case
of around `240s`\r\nwhich happens right before the next scheduled query;
the transform query\r\nruns every `60s` which accounts for the variation
between the worst and\r\nbest case delay. Since the rules run on a
seperate schedule, it's hard\r\nto know where we are in the `60s` cycle
so the best we can do is account\r\nfor the \"best case\".\r\n\r\nThis
PR also fixes #168747\r\n\r\n### Note to the reviewer\r\n\r\nThe changes
made to `evaluate.ts` and `build_query.ts` look more\r\nextensive than
they really are. I felt like #168735 made some\r\nunnecessary refactors
when they simply could have done a minimal change\r\nand left the rest
of the code alone; it would have been less risky. This\r\nalso cause
several issues during the merge which is why I ultimately\r\ndecided to
revert the changes from
#168735.","sha":"dce8eedf561409c2209bae58ecc330d8f245fa91","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:
Actionable
Observability","backport:prev-minor","v8.12.0"],"number":169011,"url":"https://github.com/elastic/kibana/pull/169011","mergeCommit":{"message":"[SLO]
Account for the built-in delay for burn rate alerting (#169011)\n\n##
Summary\r\n\r\nThis PR introduces a delay based on:\r\n\r\n- The
interval of the date histogram which defaults to `1m`\r\n- The sync
delay of the transform which defaults to `1m`\r\n- The frequency of the
transform which defaults to `1m`\r\n\r\nOn average the SLO data is about
`180s` behind for occurances. If the\r\nuser uses `5m` time slices then
the delay is around `420s`. This PR\r\nattempts to mitigate this delay
by subtracting the `interval + syncDelay\r\n+ frequency` from `now` on
any calculation for the burn rate so that the\r\nlast 5 minutes is
aligned with the data. Below is a visualization that\r\nshows how much
delay we are seeing in an optimal environment.\r\n\r\n<img width=\"884\"
alt=\"image\"\r\nsrc=\"https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89\">\r\n\r\nUsing
`interval + syncDelay + frequency` accounts for the \"best
case\r\nscenario\". Due to the nature of the transform system, the
delays varies\r\nfrom best case of `180s` for occurrences to worst case
of around `240s`\r\nwhich happens right before the next scheduled query;
the transform query\r\nruns every `60s` which accounts for the variation
between the worst and\r\nbest case delay. Since the rules run on a
seperate schedule, it's hard\r\nto know where we are in the `60s` cycle
so the best we can do is account\r\nfor the \"best case\".\r\n\r\nThis
PR also fixes #168747\r\n\r\n### Note to the reviewer\r\n\r\nThe changes
made to `evaluate.ts` and `build_query.ts` look more\r\nextensive than
they really are. I felt like #168735 made some\r\nunnecessary refactors
when they simply could have done a minimal change\r\nand left the rest
of the code alone; it would have been less risky. This\r\nalso cause
several issues during the merge which is why I ultimately\r\ndecided to
revert the changes from
#168735.","sha":"dce8eedf561409c2209bae58ecc330d8f245fa91"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/169011","number":169011,"mergeCommit":{"message":"[SLO]
Account for the built-in delay for burn rate alerting (#169011)\n\n##
Summary\r\n\r\nThis PR introduces a delay based on:\r\n\r\n- The
interval of the date histogram which defaults to `1m`\r\n- The sync
delay of the transform which defaults to `1m`\r\n- The frequency of the
transform which defaults to `1m`\r\n\r\nOn average the SLO data is about
`180s` behind for occurances. If the\r\nuser uses `5m` time slices then
the delay is around `420s`. This PR\r\nattempts to mitigate this delay
by subtracting the `interval + syncDelay\r\n+ frequency` from `now` on
any calculation for the burn rate so that the\r\nlast 5 minutes is
aligned with the data. Below is a visualization that\r\nshows how much
delay we are seeing in an optimal environment.\r\n\r\n<img width=\"884\"
alt=\"image\"\r\nsrc=\"https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89\">\r\n\r\nUsing
`interval + syncDelay + frequency` accounts for the \"best
case\r\nscenario\". Due to the nature of the transform system, the
delays varies\r\nfrom best case of `180s` for occurrences to worst case
of around `240s`\r\nwhich happens right before the next scheduled query;
the transform query\r\nruns every `60s` which accounts for the variation
between the worst and\r\nbest case delay. Since the rules run on a
seperate schedule, it's hard\r\nto know where we are in the `60s` cycle
so the best we can do is account\r\nfor the \"best case\".\r\n\r\nThis
PR also fixes #168747\r\n\r\n### Note to the reviewer\r\n\r\nThe changes
made to `evaluate.ts` and `build_query.ts` look more\r\nextensive than
they really are. I felt like #168735 made some\r\nunnecessary refactors
when they simply could have done a minimal change\r\nand left the rest
of the code alone; it would have been less risky. This\r\nalso cause
several issues during the merge which is why I ultimately\r\ndecided to
revert the changes from
#168735.","sha":"dce8eedf561409c2209bae58ecc330d8f245fa91"}}]}]
BACKPORT-->
@simianhacker simianhacker deleted the delay-from-slo branch April 17, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Feature:SLO release_note:fix Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.11.0 v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SLO] Error rate chart should use the minimum interval of the SLO
6 participants