-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SLO] Account for the built-in delay for burn rate alerting #169011
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
Pinging @elastic/actionable-observability (Team: Actionable Observability) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good to me. I have just a suggestion for the sli client and the added startedAt param for testing purpose.
Going to run it locally with some timeslice SLO now
x-pack/plugins/observability/public/utils/slo/get_delay_in_seconds_from_slo.ts
Show resolved
Hide resolved
x-pack/plugins/observability/public/components/slo/error_rate_chart/use_lens_definition.ts
Show resolved
Hide resolved
x-pack/plugins/observability/server/domain/services/get_delay_in_seconds_from_slo.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💚 Build Succeeded
Metrics [docs]Module Count
Async chunks
History
To update your PR or re-run it, just comment with: |
💔 All backports failed
Manual backportTo create the backport manually run:
Questions ?Please refer to the Backport tool documentation |
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…169011) ## Summary This PR introduces a delay based on: - The interval of the date histogram which defaults to `1m` - The sync delay of the transform which defaults to `1m` - The frequency of the transform which defaults to `1m` On average the SLO data is about `180s` behind for occurances. If the user uses `5m` time slices then the delay is around `420s`. This PR attempts to mitigate this delay by subtracting the `interval + syncDelay + frequency` from `now` on any calculation for the burn rate so that the last 5 minutes is aligned with the data. Below is a visualization that shows how much delay we are seeing in an optimal environment. <img width="884" alt="image" src="https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89"> Using `interval + syncDelay + frequency` accounts for the "best case scenario". Due to the nature of the transform system, the delays varies from best case of `180s` for occurrences to worst case of around `240s` which happens right before the next scheduled query; the transform query runs every `60s` which accounts for the variation between the worst and best case delay. Since the rules run on a seperate schedule, it's hard to know where we are in the `60s` cycle so the best we can do is account for the "best case". This PR also fixes elastic#168747 ### Note to the reviewer The changes made to `evaluate.ts` and `build_query.ts` look more extensive than they really are. I felt like elastic#168735 made some unnecessary refactors when they simply could have done a minimal change and left the rest of the code alone; it would have been less risky. This also cause several issues during the merge which is why I ultimately decided to revert the changes from elastic#168735. (cherry picked from commit dce8eed) # Conflicts: # x-pack/plugins/observability/server/lib/rules/slo_burn_rate/executor.ts # x-pack/plugins/observability/server/lib/rules/slo_burn_rate/lib/build_query.test.ts # x-pack/plugins/observability/server/lib/rules/slo_burn_rate/lib/build_query.ts
…169011) (#169843) # Backport This will backport the following commits from `main` to `8.11`: - [[SLO] Account for the built-in delay for burn rate alerting (#169011)](#169011) <!--- Backport version: 8.9.8 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Chris Cowan","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-10-19T20:09:18Z","message":"[SLO] Account for the built-in delay for burn rate alerting (#169011)\n\n## Summary\r\n\r\nThis PR introduces a delay based on:\r\n\r\n- The interval of the date histogram which defaults to `1m`\r\n- The sync delay of the transform which defaults to `1m`\r\n- The frequency of the transform which defaults to `1m`\r\n\r\nOn average the SLO data is about `180s` behind for occurances. If the\r\nuser uses `5m` time slices then the delay is around `420s`. This PR\r\nattempts to mitigate this delay by subtracting the `interval + syncDelay\r\n+ frequency` from `now` on any calculation for the burn rate so that the\r\nlast 5 minutes is aligned with the data. Below is a visualization that\r\nshows how much delay we are seeing in an optimal environment.\r\n\r\n<img width=\"884\" alt=\"image\"\r\nsrc=\"https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89\">\r\n\r\nUsing `interval + syncDelay + frequency` accounts for the \"best case\r\nscenario\". Due to the nature of the transform system, the delays varies\r\nfrom best case of `180s` for occurrences to worst case of around `240s`\r\nwhich happens right before the next scheduled query; the transform query\r\nruns every `60s` which accounts for the variation between the worst and\r\nbest case delay. Since the rules run on a seperate schedule, it's hard\r\nto know where we are in the `60s` cycle so the best we can do is account\r\nfor the \"best case\".\r\n\r\nThis PR also fixes #168747\r\n\r\n### Note to the reviewer\r\n\r\nThe changes made to `evaluate.ts` and `build_query.ts` look more\r\nextensive than they really are. I felt like #168735 made some\r\nunnecessary refactors when they simply could have done a minimal change\r\nand left the rest of the code alone; it would have been less risky. This\r\nalso cause several issues during the merge which is why I ultimately\r\ndecided to revert the changes from #168735.","sha":"dce8eedf561409c2209bae58ecc330d8f245fa91","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team: Actionable Observability","backport:prev-minor","v8.12.0"],"number":169011,"url":"https://github.com/elastic/kibana/pull/169011","mergeCommit":{"message":"[SLO] Account for the built-in delay for burn rate alerting (#169011)\n\n## Summary\r\n\r\nThis PR introduces a delay based on:\r\n\r\n- The interval of the date histogram which defaults to `1m`\r\n- The sync delay of the transform which defaults to `1m`\r\n- The frequency of the transform which defaults to `1m`\r\n\r\nOn average the SLO data is about `180s` behind for occurances. If the\r\nuser uses `5m` time slices then the delay is around `420s`. This PR\r\nattempts to mitigate this delay by subtracting the `interval + syncDelay\r\n+ frequency` from `now` on any calculation for the burn rate so that the\r\nlast 5 minutes is aligned with the data. Below is a visualization that\r\nshows how much delay we are seeing in an optimal environment.\r\n\r\n<img width=\"884\" alt=\"image\"\r\nsrc=\"https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89\">\r\n\r\nUsing `interval + syncDelay + frequency` accounts for the \"best case\r\nscenario\". Due to the nature of the transform system, the delays varies\r\nfrom best case of `180s` for occurrences to worst case of around `240s`\r\nwhich happens right before the next scheduled query; the transform query\r\nruns every `60s` which accounts for the variation between the worst and\r\nbest case delay. Since the rules run on a seperate schedule, it's hard\r\nto know where we are in the `60s` cycle so the best we can do is account\r\nfor the \"best case\".\r\n\r\nThis PR also fixes #168747\r\n\r\n### Note to the reviewer\r\n\r\nThe changes made to `evaluate.ts` and `build_query.ts` look more\r\nextensive than they really are. I felt like #168735 made some\r\nunnecessary refactors when they simply could have done a minimal change\r\nand left the rest of the code alone; it would have been less risky. This\r\nalso cause several issues during the merge which is why I ultimately\r\ndecided to revert the changes from #168735.","sha":"dce8eedf561409c2209bae58ecc330d8f245fa91"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/169011","number":169011,"mergeCommit":{"message":"[SLO] Account for the built-in delay for burn rate alerting (#169011)\n\n## Summary\r\n\r\nThis PR introduces a delay based on:\r\n\r\n- The interval of the date histogram which defaults to `1m`\r\n- The sync delay of the transform which defaults to `1m`\r\n- The frequency of the transform which defaults to `1m`\r\n\r\nOn average the SLO data is about `180s` behind for occurances. If the\r\nuser uses `5m` time slices then the delay is around `420s`. This PR\r\nattempts to mitigate this delay by subtracting the `interval + syncDelay\r\n+ frequency` from `now` on any calculation for the burn rate so that the\r\nlast 5 minutes is aligned with the data. Below is a visualization that\r\nshows how much delay we are seeing in an optimal environment.\r\n\r\n<img width=\"884\" alt=\"image\"\r\nsrc=\"https://github.com/elastic/kibana/assets/41702/2a1587cd-c789-403c-97e2-f48c65db2b89\">\r\n\r\nUsing `interval + syncDelay + frequency` accounts for the \"best case\r\nscenario\". Due to the nature of the transform system, the delays varies\r\nfrom best case of `180s` for occurrences to worst case of around `240s`\r\nwhich happens right before the next scheduled query; the transform query\r\nruns every `60s` which accounts for the variation between the worst and\r\nbest case delay. Since the rules run on a seperate schedule, it's hard\r\nto know where we are in the `60s` cycle so the best we can do is account\r\nfor the \"best case\".\r\n\r\nThis PR also fixes #168747\r\n\r\n### Note to the reviewer\r\n\r\nThe changes made to `evaluate.ts` and `build_query.ts` look more\r\nextensive than they really are. I felt like #168735 made some\r\nunnecessary refactors when they simply could have done a minimal change\r\nand left the rest of the code alone; it would have been less risky. This\r\nalso cause several issues during the merge which is why I ultimately\r\ndecided to revert the changes from #168735.","sha":"dce8eedf561409c2209bae58ecc330d8f245fa91"}}]}] BACKPORT-->
Summary
This PR introduces a delay based on:
1m
1m
1m
On average the SLO data is about
180s
behind for occurances. If the user uses5m
time slices then the delay is around420s
. This PR attempts to mitigate this delay by subtracting theinterval + syncDelay + frequency
fromnow
on any calculation for the burn rate so that the last 5 minutes is aligned with the data. Below is a visualization that shows how much delay we are seeing in an optimal environment.Using
interval + syncDelay + frequency
accounts for the "best case scenario". Due to the nature of the transform system, the delays varies from best case of180s
for occurrences to worst case of around240s
which happens right before the next scheduled query; the transform query runs every60s
which accounts for the variation between the worst and best case delay. Since the rules run on a seperate schedule, it's hard to know where we are in the60s
cycle so the best we can do is account for the "best case".This PR also fixes #168747
Note to the reviewer
The changes made to
evaluate.ts
andbuild_query.ts
look more extensive than they really are. I felt like #168735 made some unnecessary refactors when they simply could have done a minimal change and left the rest of the code alone; it would have been less risky. This also cause several issues during the merge which is why I ultimately decided to revert the changes from #168735.