Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix search telemetry to only update SO periodically #93130

Merged
merged 18 commits into from
Mar 11, 2021

Conversation

lukasolson
Copy link
Member

Summary

Resolves #92055.

We have gotten a few reports that collecting telemetry for search requests (success/error/duration) can cause significant load in some high-demand clusters. Specifically, there have been reports that there are a lot of requests to update the saved object when search requests return.

This PR updates the behavior so that we only update the saved object at most once every 5 seconds, which should not only reduce the load, but also result in fewer version conflict errors.

@lukasolson lukasolson added review Feature:Search Querying infrastructure in Kibana Feature:Telemetry v8.0.0 Team:AppServices release_note:skip Skip the PR/issue when compiling release notes v7.12.0 labels Mar 1, 2021
@lukasolson lukasolson requested review from lizozom and Dosant March 1, 2021 21:26
@lukasolson lukasolson self-assigned this Mar 1, 2021
@lukasolson lukasolson requested a review from a team as a code owner March 1, 2021 21:26
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServices)

@lukasolson lukasolson removed the review label Mar 1, 2021

// Since search requests may have completed while the saved object was being updated, we minus
// what was just updated in the saved object rather than resetting the values to 0
collectedUsage.successCount -= attributes.successCount ?? 0;
Copy link
Contributor

@lizozom lizozom Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have a race condition here, if collectedUsage gets updated while the incrementCounter request is ongoing. Then you would delete unsaved telemetry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be reset from counterFields rather than attributes

@@ -64,6 +81,7 @@ export function usageProvider(core: CoreSetup): SearchUsage {
export function searchUsageObserver(logger: Logger, usage?: SearchUsage) {
return {
next(response: IEsSearchResponse) {
if (!isCompleteResponse(response)) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lukasolson lukasolson requested a review from lizozom March 2, 2021 17:36
@lukasolson
Copy link
Member Author

@elasticmachine merge upstream

Copy link
Contributor

@Dosant Dosant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested. See telemetry debounced.

@@ -64,6 +82,7 @@ export function usageProvider(core: CoreSetup): SearchUsage {
export function searchUsageObserver(logger: Logger, usage?: SearchUsage) {
return {
next(response: IEsSearchResponse) {
if (!isCompleteResponse(response)) return;
logger.debug(`trackSearchStatus:next ${response.rawResponse.took}`);
usage?.trackSuccess(response.rawResponse.took);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not related to this pr, but I want to point out:

When we restore a search session then this took is the original time that query took.

so assume:

  1. Doing a search the first time - Request completes in ~10ms and took is 10ms
  2. Restoring a search session - Request completes ~3ms and took is still 10ms

I wonder if we actually want to track this took here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we probably don't want to track at all if isRestore is set to true.

.filter(({ incrementBy }) => incrementBy > 0);

try {
await repository.incrementCounter<CollectedUsage>(
Copy link
Contributor

@Dosant Dosant Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could it be that repository.incrementCounter took so long that the next request is starting? In this case, the next request will pick up incorrect stats.
I assume this won't happen as 5 seconds is a huge margin and we can leave it as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess that's a possible edge case. I would think Kibana would be pretty much unusable if an incrementCounter request takes more than 5s, so we'll leave it as is for now.

}
}
const collectedUsage: CollectedUsage = {
successCount: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we are dynamically mapping these fields into telemetry objects. I wonder if this dynamic mapping makes it very easy to change field names without realizing that this breaks about telemetry logs structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this might have some implications if we work in clustering mode 🤔
#68626

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @lukasolson

@lukasolson lukasolson merged commit 813e16f into elastic:master Mar 11, 2021
lukasolson added a commit to lukasolson/kibana that referenced this pull request Mar 11, 2021
* Fix search telemetry to only update SO periodically

* Handle case when searches completed mid flight

* Fix error in resetting counters

* Update docs

* update docs

* Don't track restored searches

* Update docs

* Update docs

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Anton Dosov <[email protected]>
lukasolson added a commit to lukasolson/kibana that referenced this pull request Mar 11, 2021
* Fix search telemetry to only update SO periodically

* Handle case when searches completed mid flight

* Fix error in resetting counters

* Update docs

* update docs

* Don't track restored searches

* Update docs

* Update docs

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Anton Dosov <[email protected]>
# Conflicts:
#	api_docs/cases.json
#	api_docs/data_search.json
#	api_docs/lists.json
#	api_docs/security_solution.json
lukasolson added a commit that referenced this pull request Mar 11, 2021
* Fix search telemetry to only update SO periodically

* Handle case when searches completed mid flight

* Fix error in resetting counters

* Update docs

* update docs

* Don't track restored searches

* Update docs

* Update docs

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Anton Dosov <[email protected]>
# Conflicts:
#	api_docs/cases.json
#	api_docs/data_search.json
#	api_docs/lists.json
#	api_docs/security_solution.json
lukasolson added a commit that referenced this pull request Mar 11, 2021
* Fix search telemetry to only update SO periodically

* Handle case when searches completed mid flight

* Fix error in resetting counters

* Update docs

* update docs

* Don't track restored searches

* Update docs

* Update docs

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Anton Dosov <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Anton Dosov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Search Querying infrastructure in Kibana Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes review v7.12.0 v7.13.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve search telemetry
5 participants