[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260

doakalexi · 2024-11-14T21:12:42Z

Resolves #196584

Summary

This PR updates the task poll interval logic for projects using the mget strategy to optimize request loads to Elasticsearch, particularly for smaller projects with low utilization. When task manager (TM) utilization is below 25%, the poll interval will be set to 3 seconds instead of the current 500 milliseconds. This change does not affect projects utilizing update_by_query.

The existing backpressure logic remains unchanged for handling errors. The only adjustment occurs in scenarios where there are no errors, the TM utilization is below 25%, and the poll interval is less than 3 seconds. In such cases, the poll interval will increase to 3 seconds, even if the backpressure logic has not fully reset the interval to its original value.

I just chose 25%, but I am definitely open to other ideas.

Checklist

Unit or functional tests were updated or added to match the most common scenarios

To verify

Start Kibana and go to http://localhost:5601/api/task_manager/_health and verify the poll interval is 3s
Create some alerting rules scheduled to run every second, and let them run. I created 4 rules.
Check http://localhost:5601/api/task_manager/_health again to verify that with rules running the poll interval is back to 500ms. (It may take a couple refreshes for the health api to reflect the changes)

…et-poll-interval

…-fix'

…i/kibana into task-manager/mget-poll-interval

elasticmachine · 2024-11-21T17:27:35Z

Pinging @elastic/response-ops (Team:ResponseOps)

js-jankisalvi

Verified locally, works as expected 👍

pmuellr · 2024-11-26T15:00:32Z

I ran with this PR and created 10 rules running 1/s. I seemed to get 4-6 of the following messages, per minute: messages Poll interval configuration is temporarily increased after a decrease in the task load. Sometimes, 3 times within one second.

When I doubled it to 20 rules, and seems like it was about the same volume.

I didn't look to see what's going on there, but wondering if there's some race condition going on.

It also seems confusing that we don't generate a message when we switch back to "fast" mode.

Looking at the requirements from the issue #196584 :

We'll need to ensure the Kibana nodes are still running tasks evenly so one doesn't poll frequently and run all the tasks while the other doesn't poll frequently and doesn't find many tasks to run
Poll interval is 3s whenever the task load doesn't need a 500ms poll interval

Will utilizing task utilization take care of keeping Kibana nodes running tasks evenly? Is running evenly really required anyway?

Did we consider any alternatives to using the task utilization? I was wondering if we could "read ahead" in the task claimer - usually we ask for tasks ready with run_at < ${now}, but we could ask for < ${now + 3s} instead. We wouldn't run any of those rules > ${now}, but we could use them as an indication of whether to delay the next task poll by 3s.

doakalexi · 2024-11-27T16:35:59Z

I didn't look to see what's going on there, but wondering if there's some race condition going on.

Thanks for taking a look! I can take a look to double check.

It also seems confusing that we don't generate a message when we switch back to "fast" mode.

I will add this

Looking at the requirements from the issue #196584 :

We'll need to ensure the Kibana nodes are still running tasks evenly so one doesn't poll frequently and run all the tasks while the other doesn't poll frequently and doesn't find many tasks to run
Poll interval is 3s whenever the task load doesn't need a 500ms poll interval

Will utilizing task utilization take care of keeping Kibana nodes running tasks evenly? Is running evenly really required anyway?

I am not sure, I didn't know the best way to ensure that the nodes run the tasks evenly. I am thinking bc the tasks are partitioned this would not be as big of an issue?

Did we consider any alternatives to using the task utilization? I was wondering if we could "read ahead" in the task claimer - usually we ask for tasks ready with run_at < ${now}, but we could ask for < ${now + 3s} instead. We wouldn't run any of those rules > ${now}, but we could use them as an indication of whether to delay the next task poll by 3s.

I saw some ideas in the issue, but I decided to use utilization right away. I am happy to change it you think reading ahead is better.

mikecote · 2024-11-27T16:45:59Z

Looking at the requirements from the issue #196584 :

We'll need to ensure the Kibana nodes are still running tasks evenly so one doesn't poll frequently and run all the tasks while the other doesn't poll frequently and doesn't find many tasks to run
Poll interval is 3s whenever the task load doesn't need a 500ms poll interval

Will utilizing task utilization take care of keeping Kibana nodes running tasks evenly? Is running evenly really required anyway?

I am not sure, I didn't know the best way to ensure that the nodes run the tasks evenly. I am thinking bc the tasks are partitioned this would not be as big of an issue?

I think this will be hard to detect. Perhaps something we leave out for now in hindsight.

Did we consider any alternatives to using the task utilization? I was wondering if we could "read ahead" in the task claimer - usually we ask for tasks ready with run_at < ${now}, but we could ask for < ${now + 3s} instead. We wouldn't run any of those rules > ${now}, but we could use them as an indication of whether to delay the next task poll by 3s.

I saw some ideas in the issue, but I decided to use utilization right away. I am happy to change it you think reading ahead is better.

I think anything simple would be good, it would seem complex if we have to modify the claiming query and funnel that to the managed configuration module to save some queries per minute. If we see issues with the current implementation, whether flapping or what not, we can look at alternatives like only poll less frequently if we didn't claim any tasks for X amount of time (ex: 3s)

…et-poll-interval

…i/kibana into task-manager/mget-poll-interval

pmuellr · 2024-12-04T19:34:20Z

I think anything simple would be good, it would seem complex if we have to modify the claiming query and funnel that to the managed configuration module to save some queries per minute.

Ya, works for me for now. I suspect at some point we're going to have to do some live analysis of "upcoming tasks", and once we do that, we could feed that info into this calculation.

pmuellr · 2024-12-04T20:26:27Z

I tested this again; 50 rules running 1s intervals, capacity 50. Rules take ~0.5s to run (some kind of no-op query). Here's the executions from the event log, 1s intervals:

Kind of interesting there are gaps at all, but I suspect the >= 3s ones are from when it flipped from 0.5s to 3s.
I also changed the message Poll interval configuration changing from ... from a debug to warn, and added tmUtilization to the message.

That is actually one change I would like to see - that message should be a warn anyway, and then we don't need the message logged after it Poll interval configuration is temporarily increased after .... And the utilization # should be in there. We'll want this for diagnostic purposes, otherwise it may be difficult/impossible to figure out what the actual numbers were.

Here are the messages I saw:

14:57:00.396 configuration changing from 500 to 3000 after change in the task load: 22.
14:57:10.397 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:19.174 configuration changing from 500 to 3000 after change in the task load: 0.
14:57:19.174 configuration changing from 500 to 3000 after change in the task load: 0.
14:57:29.274 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:29.275 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:30.628 configuration changing from 500 to 3000 after change in the task load: 0.
14:57:40.628 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:59.315 configuration changing from 500 to 3000 after change in the task load: 20.
14:57:59.315 configuration changing from 500 to 3000 after change in the task load: 20.
14:58:00.762 configuration changing from 500 to 3000 after change in the task load: 0.
14:58:09.332 configuration changing from 3000 to 500 after change in the task load: 100.
14:58:09.333 configuration changing from 3000 to 500 after change in the task load: 100.
14:58:10.771 configuration changing from 3000 to 500 after change in the task load: 100.
14:58:41.144 configuration changing from 500 to 3000 after change in the task load: 4.
14:58:51.159 configuration changing from 3000 to 500 after change in the task load: 100.

We presumably have some internal race condition causing the multiple messages so close together at times.

Beyond that, it seems like we're flipping it too much, presumably too sensitive to the quick changes to the utilization.

I wonder if there's some way we could look at this via a "bigger" window, like check tm utilization over the last minute, or something - I assume the window it's using is much smaller, or completely dynamic?

@mikecote Would it make sense to change the utilization number itself to a wider window, or are we sensitive on that number for other reasons and would maybe want a "smoother" version of it? Thinking of ye olde system load numbers, 1M, 5M, 15M averages (or whatever). Maybe we have "current", "last minute", or such. I suspect a smoother number will yield less flip flopping ...

If we think the level of flipping reported above is ok - maybe it is? - then I think I want the changes logged in the EL, instead of Kibana log. It's just going to be too noisy and an SDH generator if we're logging this much.

mikecote · 2024-12-09T12:21:17Z

@mikecote Would it make sense to change the utilization number itself to a wider window, or are we sensitive on that number for other reasons and would maybe want a "smoother" version of it? Thinking of ye olde system load numbers, 1M, 5M, 15M averages (or whatever). Maybe we have "current", "last minute", or such. I suspect a smoother number will yield less flip flopping ...

If we think the level of flipping reported above is ok - maybe it is? - then I think I want the changes logged in the EL, instead of Kibana log. It's just going to be too noisy and an SDH generator if we're logging this much.

Just getting to this message now, sorry for the delay. Let's discuss during today's weekly if we can!

…et-poll-interval

elasticmachine · 2025-01-02T17:24:47Z

⏳ Build in-progress, with failures

Buildkite Build
Commit: d6bd09e

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #31 / Alerting builtin alertTypes circuit_breakers index threshold rule that hits max alerts circuit breaker persist existing alerts to next execution if circuit breaker is hit
[job] [logs] x-pack/test/alerting_api_integration/spaces_only/tests/alerting/group1/config.ts / Alerting eventLog in space default should generate expected events for flapping alerts that settle on active where notifyWhen is NOT set to "on status change"
[job] [logs] x-pack/test/alerting_api_integration/spaces_only/tests/alerting/group1/config.ts / Alerting eventLog in space default should generate expected events for normal operation
[job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration default claim strategy should log a warning when the configuration changes from the starting value
[job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration default claim strategy should log a warning when the configuration changes from the starting value
[job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should increase configuration at the next interval when an error is emitted
[job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should increase configuration at the next interval when an error is emitted
[job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should log a warning when the configuration changes from the starting value
[job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should log a warning when the configuration changes from the starting value
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns 429 error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns 429 error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns a cluster_block_exception error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns a cluster_block_exception error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns 429 error
[job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns 429 error
[job] [logs] FTR Configs #96 / task_manager health should return basic configuration of task manager
[job] [logs] FTR Configs #96 / task_manager health should return basic configuration of task manager

History

💔 Build #263736 failed 7e10912
💔 Build #263723 failed 239d5b8
💔 Build #257100 failed c1138e3
💛 Build #256069 was flaky aa66b87

doakalexi and others added 5 commits November 14, 2024 13:12

Updating task manager to increase poll interval when utilization is low

08b264e

Merge branch 'main' of github.com:elastic/kibana into task-manager/mg…

751ba85

…et-poll-interval

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

9f86721

…-fix'

Adding new tests

e22c2ba

Merge branch 'task-manager/mget-poll-interval' of github.com:doakalex…

cd3f0e3

…i/kibana into task-manager/mget-poll-interval

doakalexi changed the title ~~Updating task manager to increase poll interval when utilization is low~~ [ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it Nov 14, 2024

doakalexi and others added 4 commits November 18, 2024 11:58

Adding new integration tests

e7c2589

Merge branch 'main' into task-manager/mget-poll-interval

28aa867

Fixing functional tests

a3d7eb1

Merge branch 'task-manager/mget-poll-interval' of github.com:doakalex…

04219bf

…i/kibana into task-manager/mget-poll-interval

doakalexi marked this pull request as ready for review November 21, 2024 17:27

doakalexi requested a review from a team as a code owner November 21, 2024 17:27

Merge branch 'main' into task-manager/mget-poll-interval

62cda35

doakalexi requested review from js-jankisalvi and pmuellr November 21, 2024 17:28

js-jankisalvi approved these changes Nov 25, 2024

View reviewed changes

doakalexi added v8.18.0 and removed v8.17.0 labels Nov 25, 2024

Merge branch 'main' into task-manager/mget-poll-interval

7433aa3

doakalexi and others added 2 commits November 27, 2024 12:26

Merge branch 'main' of github.com:elastic/kibana into task-manager/mg…

6f26535

…et-poll-interval

Merge branch 'main' into task-manager/mget-poll-interval

aa66b87

doakalexi and others added 4 commits December 4, 2024 10:49

Merge branch 'task-manager/mget-poll-interval' of github.com:doakalex…

48aa436

…i/kibana into task-manager/mget-poll-interval

Adding log message when poll interval is reset

7994fff

Fixing flaky test

ec2e546

Merge branch 'main' into task-manager/mget-poll-interval

c1138e3

doakalexi added 3 commits January 2, 2025 09:33

Merge branch 'main' of github.com:elastic/kibana into task-manager/mg…

239d5b8

…et-poll-interval

Fixing bad changes from merge conflicts

7e10912

Fixing more bad merge conflicts

d6bd09e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260

[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260

doakalexi commented Nov 14, 2024 •

edited

Loading

elasticmachine commented Nov 21, 2024

js-jankisalvi left a comment

pmuellr commented Nov 26, 2024

doakalexi commented Nov 27, 2024

mikecote commented Nov 27, 2024

pmuellr commented Dec 4, 2024

pmuellr commented Dec 4, 2024 •

edited

Loading

mikecote commented Dec 9, 2024

elasticmachine commented Jan 2, 2025 •

edited

Loading

[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260

Are you sure you want to change the base?

[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260

Conversation

doakalexi commented Nov 14, 2024 • edited Loading

Summary

Checklist

To verify

elasticmachine commented Nov 21, 2024

js-jankisalvi left a comment

Choose a reason for hiding this comment

pmuellr commented Nov 26, 2024

doakalexi commented Nov 27, 2024

mikecote commented Nov 27, 2024

pmuellr commented Dec 4, 2024

pmuellr commented Dec 4, 2024 • edited Loading

mikecote commented Dec 9, 2024

elasticmachine commented Jan 2, 2025 • edited Loading

⏳ Build in-progress, with failures

Failed CI Steps

Test Failures

History

doakalexi commented Nov 14, 2024 •

edited

Loading

pmuellr commented Dec 4, 2024 •

edited

Loading

elasticmachine commented Jan 2, 2025 •

edited

Loading