-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix search telemetry to only update SO periodically #93130
Conversation
Pinging @elastic/kibana-app-services (Team:AppServices) |
|
||
// Since search requests may have completed while the saved object was being updated, we minus | ||
// what was just updated in the saved object rather than resetting the values to 0 | ||
collectedUsage.successCount -= attributes.successCount ?? 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have a race condition here, if collectedUsage
gets updated while the incrementCounter
request is ongoing. Then you would delete unsaved telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be reset from counterFields
rather than attributes
@@ -64,6 +81,7 @@ export function usageProvider(core: CoreSetup): SearchUsage { | |||
export function searchUsageObserver(logger: Logger, usage?: SearchUsage) { | |||
return { | |||
next(response: IEsSearchResponse) { | |||
if (!isCompleteResponse(response)) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@elasticmachine merge upstream |
# Conflicts: # api_docs/core.json # api_docs/core_http.json # api_docs/fleet.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested. See telemetry debounced.
@@ -64,6 +82,7 @@ export function usageProvider(core: CoreSetup): SearchUsage { | |||
export function searchUsageObserver(logger: Logger, usage?: SearchUsage) { | |||
return { | |||
next(response: IEsSearchResponse) { | |||
if (!isCompleteResponse(response)) return; | |||
logger.debug(`trackSearchStatus:next ${response.rawResponse.took}`); | |||
usage?.trackSuccess(response.rawResponse.took); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not related to this pr, but I want to point out:
When we restore a search session then this took
is the original time that query took.
so assume:
- Doing a search the first time - Request completes in ~10ms and
took
is 10ms - Restoring a search session - Request completes ~3ms and
took
is still 10ms
I wonder if we actually want to track this took
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we probably don't want to track at all if isRestore
is set to true
.
.filter(({ incrementBy }) => incrementBy > 0); | ||
|
||
try { | ||
await repository.incrementCounter<CollectedUsage>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could it be that repository.incrementCounter
took so long that the next request is starting? In this case, the next request will pick up incorrect stats.
I assume this won't happen as 5 seconds is a huge margin and we can leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess that's a possible edge case. I would think Kibana would be pretty much unusable if an incrementCounter
request takes more than 5s, so we'll leave it as is for now.
} | ||
} | ||
const collectedUsage: CollectedUsage = { | ||
successCount: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we are dynamically mapping these fields into telemetry objects. I wonder if this dynamic mapping makes it very easy to change field names without realizing that this breaks about telemetry logs structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this might have some implications if we work in clustering mode 🤔
#68626
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: cc @lukasolson |
* Fix search telemetry to only update SO periodically * Handle case when searches completed mid flight * Fix error in resetting counters * Update docs * update docs * Don't track restored searches * Update docs * Update docs Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Anton Dosov <[email protected]>
* Fix search telemetry to only update SO periodically * Handle case when searches completed mid flight * Fix error in resetting counters * Update docs * update docs * Don't track restored searches * Update docs * Update docs Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Anton Dosov <[email protected]> # Conflicts: # api_docs/cases.json # api_docs/data_search.json # api_docs/lists.json # api_docs/security_solution.json
* Fix search telemetry to only update SO periodically * Handle case when searches completed mid flight * Fix error in resetting counters * Update docs * update docs * Don't track restored searches * Update docs * Update docs Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Anton Dosov <[email protected]> # Conflicts: # api_docs/cases.json # api_docs/data_search.json # api_docs/lists.json # api_docs/security_solution.json
* Fix search telemetry to only update SO periodically * Handle case when searches completed mid flight * Fix error in resetting counters * Update docs * update docs * Don't track restored searches * Update docs * Update docs Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Anton Dosov <[email protected]> Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Anton Dosov <[email protected]>
Summary
Resolves #92055.
We have gotten a few reports that collecting telemetry for search requests (success/error/duration) can cause significant load in some high-demand clusters. Specifically, there have been reports that there are a lot of requests to update the saved object when search requests return.
This PR updates the behavior so that we only update the saved object at most once every 5 seconds, which should not only reduce the load, but also result in fewer version conflict errors.