Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Use ES client to update rule SO at end of rule run instead of SO client. #193341

Merged
merged 10 commits into from
Sep 30, 2024

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Sep 18, 2024

Resolves #192397

Summary

Updates alerting task runner end of run updates to use the ES client update function for a true partial update instead of the saved objects client update function that performs a GET then an update.

To verify

Create a rule in multiple spaces and ensure they run correctly and their execution status and monitoring history are updated at the end of each run. Because we're performing a partial update on attributes that are not in the AAD, the rule should continue running without any encryption errors.

Risk Matrix

Risk Probability Severity Mitigation/Notes
Updating saved object directly using ES client will break BWC Medium High Response Ops follows an intermediate release strategy for any changes to the rule saved object where schema changes are introduced in an intermediate release before any changes to the saved object are actually made in a followup release. This ensures that any rollbacks that may be required in a release will roll back to a version that is already aware of the new schema. The team is socialized to this strategy as we are requiring users of the alerting framework to also follow this strategy. This should address any backward compatibility issues that might arise by circumventing the saved objects client update function.
Updating saved object directly using ES client will break AAD Medium High An explicit allowlist of non-AAD fields that are allowed to be partially updated has been introduced and any fields not in this allowlist will not be included in the partial update. Any updates to the rule saved object that might break AAD would show up with > 1 execution of a rule and we have a plethora of functional tests that rely on multiple executions of a rule that would flag if there were issues running due to AAD issues.

@ymao1 ymao1 self-assigned this Sep 18, 2024
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v9.0.0 backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) v8.16.0 release_note:skip Skip the PR/issue when compiling release notes labels Sep 18, 2024
@ymao1 ymao1 marked this pull request as ready for review September 18, 2024 19:36
@ymao1 ymao1 requested review from a team as code owners September 18, 2024 19:36
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1 ymao1 requested review from pmuellr and guskovaue September 18, 2024 19:37
@kc13greiner kc13greiner self-requested a review September 18, 2024 20:21
Copy link
Contributor

@kc13greiner kc13greiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternalUser question below

Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the ES client and not the SO client to do partial updates to a saved object has several risks:

  • saved objects are an abstraction to the raw ES doc, making it complex to transform into the correct shape ES expects.
  • access to the internal ES indices is restricted i.t.o the request scope
  • BWC: partial updates aren’t supported and cannot be guaranteed to work. The SO client takes care of BWC, up and down migrations and all the serialization/deserialization that happens in between
  • The PR description hints to multiple namespaces. Updating a shared object means updating it’s instance in all spaces it is shared in and that requires auth checks.
  • Core had to abandon partial updates in the SO client because of BWC.
  • Core would also have to deal with any general migration issues because of docs that are corrupt from improper partial updates.

@mikecote
Copy link
Contributor

Thanks for underlying the risks, @TinaHeiligers. Our goal is to optimize for performance in this scenario because 1) we need to reduce the number of requests the alerting framework performs to Elasticsearch at scale and 2) we want to increase the task throughput per Kibana node by removing additional processing overhead (CPU bound operations, operations that add delays, etc) in favour of running more tasks. We've seen quite a bit of overhead necessary when using the saved-objects client, which made us investigate places where we may be able to work directly with the indices like we do in the Task Manager.

We've seen great results when prototyping this and it felt the tradeoffs were becoming worthwhile. We've been able to scale down the number of ES and KB nodes when running at scale because of the reduced number of overall requests made. Otherwise at 10x scale, it becomes another 32,000 additional GET requests per minute that Elasticsearch needs to handle and for Kibana to process in this particular scenario.

We don't foresee changing the list of fields often in which we perform partial updates to, if ever that becomes the case, perhaps we can use an intermediate release to do so?

@pmuellr
Copy link
Member

pmuellr commented Sep 19, 2024

  • access to the internal ES indices is restricted i.t.o the request scope

I'm not sure what this means. Maybe we'd need to add additional headers, or specifically use "asInternal" vs user API key to access (or vice versa)?

  • The PR description hints to multiple namespaces. Updating a shared object means updating it’s instance in all spaces it is shared in and that requires auth checks.

The implication from the description is to create several rules, in serveral spaces. Alerting rules are space-specific, they cannot be shared. AFAIK, there are no requests for us to make them shared.

@TinaHeiligers
Copy link
Contributor

The implication from the description is to create several rules, in serveral spaces. Alerting rules are space-specific, they cannot be shared. AFAIK, there are no requests for us to make them shared.

Perfect! Core also has no plans to allow namespace changes in the future, so this isn't a concern here.

@TinaHeiligers
Copy link
Contributor

I'm not against using the ES client for performance gains, as long as the team is fully aware of the risks. Please add a risk matrix to the PR description! It's a way of documenting decisions and sharing knowledge with other teams that might end up referencing this PR as inspiration.
Please make sure there's good code coverage and validation to catch edge cases!

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; left a comment about maybe making the partially-updated attribute list explicit. No human will easily figure it out from the TS types :-). Not sure how feasible that is though ...

options: PartiallyUpdateRuleSavedObjectOptions = {}
): Promise<void> {
// ensure we only have the valid attributes that are not encrypted and are excluded from AAD
const attributeUpdates = omit(attributes, [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems slighty dangerous, in case we forget to include one of the attribues in the lists used below.

Feels like it might be better to have an explicit list of attributes that we ALLOW to be partially updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor

@jeramysoucy jeramysoucy Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree as well. Disclaimer - typically I would categorize something like this as "strongly not recommended", but with intimate knowledge of an ESO type and careful maintenance, this is possible to do safely.

Filtering by an explicit safe list of just the attributes that we expect to modify in this way, and also omitting any encrypted/AAD attributes (just in case), seems most prudent. Even if it requires a little bit more maintenance in the long run, this has potential to go quite poorly, so we should do what we can to prevent that possibility.

cc @azasypkin to get his 2 cents as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ymao1 Can we wait to merge this until Oleg has a chance to take a look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeramysoucy Yes absolutely! I'll make the changes for the AAD attributes and will wait to merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added allowlist in this commit: 4683078

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @azasypkin and he is 👍 with an additional comment about why we're bypassing audit logging (added in 409a2fa) and a followup issue for the Core team to investigate how they might expose a more performance SO client (#194435)

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 20, 2024

Please add a risk matrix to the PR description! It's a way of documenting decisions and sharing knowledge with other teams that might end up referencing this PR as inspiration.

@TinaHeiligers I've added a risk matrix to the PR description. Thanks for flagging!

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 23, 2024

@elasticmachine merge upstream

Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly declaring a list of attributes that can be updated mitigates some of the risks involved in using the ES client directly.
LGTM

options: PartiallyUpdateRuleSavedObjectOptions = {}
): Promise<void> {
// ensure we only have the valid attributes that are not encrypted and are excluded from AAD
const attributeUpdates = omit(attributes, [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

Copy link
Contributor

@guskovaue guskovaue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally. Looks good to me!

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 30, 2024

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #85 / custom visualizations self changing vis "before all" hook for "should allow updating params via the editor"

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 merged commit 05926c2 into elastic:main Sep 30, 2024
41 checks passed
@ymao1 ymao1 deleted the alerting/es-partial-update branch September 30, 2024 14:40
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11108440052

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Sep 30, 2024
…le run instead of SO client. (elastic#193341)

Resolves elastic#192397

## Summary

Updates alerting task runner end of run updates to use the ES client
update function for a true partial update instead of the saved objects
client update function that performs a GET then an update.

## To verify
Create a rule in multiple spaces and ensure they run correctly and their
execution status and monitoring history are updated at the end of each
run. Because we're performing a partial update on attributes that are
not in the AAD, the rule should continue running without any encryption
errors.

## Risk Matrix

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Updating saved object directly using ES client will break BWC | Medium
| High | Response Ops follows an intermediate release strategy for any
changes to the rule saved object where schema changes are introduced in
an intermediate release before any changes to the saved object are
actually made in a followup release. This ensures that any rollbacks
that may be required in a release will roll back to a version that is
already aware of the new schema. The team is socialized to this strategy
as we are requiring users of the alerting framework to also follow this
strategy. This should address any backward compatibility issues that
might arise by circumventing the saved objects client update function. |
| Updating saved object directly using ES client will break AAD | Medium
| High | An explicit allowlist of non-AAD fields that are allowed to be
partially updated has been introduced and any fields not in this
allowlist will not be included in the partial update. Any updates to the
rule saved object that might break AAD would show up with > 1 execution
of a rule and we have a plethora of functional tests that rely on
multiple executions of a rule that would flag if there were issues
running due to AAD issues. |

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
(cherry picked from commit 05926c2)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Sep 30, 2024
… of rule run instead of SO client. (#193341) (#194444)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Response Ops][Alerting] Use ES client to update rule SO at end of
rule run instead of SO client.
(#193341)](#193341)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ying
Mao","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-30T14:40:02Z","message":"[Response
Ops][Alerting] Use ES client to update rule SO at end of rule run
instead of SO client. (#193341)\n\nResolves
https://github.com/elastic/kibana/issues/192397\r\n\r\n##
Summary\r\n\r\nUpdates alerting task runner end of run updates to use
the ES client\r\nupdate function for a true partial update instead of
the saved objects\r\nclient update function that performs a GET then an
update.\r\n\r\n## To verify\r\nCreate a rule in multiple spaces and
ensure they run correctly and their\r\nexecution status and monitoring
history are updated at the end of each\r\nrun. Because we're performing
a partial update on attributes that are\r\nnot in the AAD, the rule
should continue running without any encryption\r\nerrors.\r\n\r\n## Risk
Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes
|\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n|
Updating saved object directly using ES client will break BWC |
Medium\r\n| High | Response Ops follows an intermediate release strategy
for any\r\nchanges to the rule saved object where schema changes are
introduced in\r\nan intermediate release before any changes to the saved
object are\r\nactually made in a followup release. This ensures that any
rollbacks\r\nthat may be required in a release will roll back to a
version that is\r\nalready aware of the new schema. The team is
socialized to this strategy\r\nas we are requiring users of the alerting
framework to also follow this\r\nstrategy. This should address any
backward compatibility issues that\r\nmight arise by circumventing the
saved objects client update function. |\r\n| Updating saved object
directly using ES client will break AAD | Medium\r\n| High | An explicit
allowlist of non-AAD fields that are allowed to be\r\npartially updated
has been introduced and any fields not in this\r\nallowlist will not be
included in the partial update. Any updates to the\r\nrule saved object
that might break AAD would show up with > 1 execution\r\nof a rule and
we have a plethora of functional tests that rely on\r\nmultiple
executions of a rule that would flag if there were issues\r\nrunning due
to AAD issues. |\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine
<[email protected]>\r\nCo-authored-by:
Elastic Machine
<[email protected]>","sha":"05926c20c57b7abc69c6c068d5733f29306f73ba","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","v9.0.0","backport:prev-minor","v8.16.0"],"title":"[Response
Ops][Alerting] Use ES client to update rule SO at end of rule run
instead of SO
client.","number":193341,"url":"https://github.com/elastic/kibana/pull/193341","mergeCommit":{"message":"[Response
Ops][Alerting] Use ES client to update rule SO at end of rule run
instead of SO client. (#193341)\n\nResolves
https://github.com/elastic/kibana/issues/192397\r\n\r\n##
Summary\r\n\r\nUpdates alerting task runner end of run updates to use
the ES client\r\nupdate function for a true partial update instead of
the saved objects\r\nclient update function that performs a GET then an
update.\r\n\r\n## To verify\r\nCreate a rule in multiple spaces and
ensure they run correctly and their\r\nexecution status and monitoring
history are updated at the end of each\r\nrun. Because we're performing
a partial update on attributes that are\r\nnot in the AAD, the rule
should continue running without any encryption\r\nerrors.\r\n\r\n## Risk
Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes
|\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n|
Updating saved object directly using ES client will break BWC |
Medium\r\n| High | Response Ops follows an intermediate release strategy
for any\r\nchanges to the rule saved object where schema changes are
introduced in\r\nan intermediate release before any changes to the saved
object are\r\nactually made in a followup release. This ensures that any
rollbacks\r\nthat may be required in a release will roll back to a
version that is\r\nalready aware of the new schema. The team is
socialized to this strategy\r\nas we are requiring users of the alerting
framework to also follow this\r\nstrategy. This should address any
backward compatibility issues that\r\nmight arise by circumventing the
saved objects client update function. |\r\n| Updating saved object
directly using ES client will break AAD | Medium\r\n| High | An explicit
allowlist of non-AAD fields that are allowed to be\r\npartially updated
has been introduced and any fields not in this\r\nallowlist will not be
included in the partial update. Any updates to the\r\nrule saved object
that might break AAD would show up with > 1 execution\r\nof a rule and
we have a plethora of functional tests that rely on\r\nmultiple
executions of a rule that would flag if there were issues\r\nrunning due
to AAD issues. |\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine
<[email protected]>\r\nCo-authored-by:
Elastic Machine
<[email protected]>","sha":"05926c20c57b7abc69c6c068d5733f29306f73ba"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/193341","number":193341,"mergeCommit":{"message":"[Response
Ops][Alerting] Use ES client to update rule SO at end of rule run
instead of SO client. (#193341)\n\nResolves
https://github.com/elastic/kibana/issues/192397\r\n\r\n##
Summary\r\n\r\nUpdates alerting task runner end of run updates to use
the ES client\r\nupdate function for a true partial update instead of
the saved objects\r\nclient update function that performs a GET then an
update.\r\n\r\n## To verify\r\nCreate a rule in multiple spaces and
ensure they run correctly and their\r\nexecution status and monitoring
history are updated at the end of each\r\nrun. Because we're performing
a partial update on attributes that are\r\nnot in the AAD, the rule
should continue running without any encryption\r\nerrors.\r\n\r\n## Risk
Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes
|\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n|
Updating saved object directly using ES client will break BWC |
Medium\r\n| High | Response Ops follows an intermediate release strategy
for any\r\nchanges to the rule saved object where schema changes are
introduced in\r\nan intermediate release before any changes to the saved
object are\r\nactually made in a followup release. This ensures that any
rollbacks\r\nthat may be required in a release will roll back to a
version that is\r\nalready aware of the new schema. The team is
socialized to this strategy\r\nas we are requiring users of the alerting
framework to also follow this\r\nstrategy. This should address any
backward compatibility issues that\r\nmight arise by circumventing the
saved objects client update function. |\r\n| Updating saved object
directly using ES client will break AAD | Medium\r\n| High | An explicit
allowlist of non-AAD fields that are allowed to be\r\npartially updated
has been introduced and any fields not in this\r\nallowlist will not be
included in the partial update. Any updates to the\r\nrule saved object
that might break AAD would show up with > 1 execution\r\nof a rule and
we have a plethora of functional tests that rely on\r\nmultiple
executions of a rule that would flag if there were issues\r\nrunning due
to AAD issues. |\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine
<[email protected]>\r\nCo-authored-by:
Elastic Machine
<[email protected]>","sha":"05926c20c57b7abc69c6c068d5733f29306f73ba"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Ying Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perform true partial updates when updating the alerting rule after execution
10 participants