Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

doakalexi
Copy link
Contributor

@doakalexi doakalexi commented Nov 14, 2024

Resolves #196584

Summary

This PR updates the task poll interval logic for projects using the mget strategy to optimize request loads to Elasticsearch, particularly for smaller projects with low utilization. When task manager (TM) utilization is below 25%, the poll interval will be set to 3 seconds instead of the current 500 milliseconds. This change does not affect projects utilizing update_by_query.

The existing backpressure logic remains unchanged for handling errors. The only adjustment occurs in scenarios where there are no errors, the TM utilization is below 25%, and the poll interval is less than 3 seconds. In such cases, the poll interval will increase to 3 seconds, even if the backpressure logic has not fully reset the interval to its original value.

I just chose 25%, but I am definitely open to other ideas.

Checklist

To verify

  • Start Kibana and go to http://localhost:5601/api/task_manager/_health and verify the poll interval is 3s
  • Create some alerting rules scheduled to run every second, and let them run. I created 4 rules.
  • Check http://localhost:5601/api/task_manager/_health again to verify that with rules running the poll interval is back to 500ms. (It may take a couple refreshes for the health api to reflect the changes)

@doakalexi doakalexi changed the title Updating task manager to increase poll interval when utilization is low [ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it Nov 14, 2024
@doakalexi doakalexi added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) release_note:skip Skip the PR/issue when compiling release notes backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) v8.17.0 v9.0.0 labels Nov 21, 2024
@doakalexi doakalexi marked this pull request as ready for review November 21, 2024 17:27
@doakalexi doakalexi requested a review from a team as a code owner November 21, 2024 17:27
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

Copy link
Contributor

@js-jankisalvi js-jankisalvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified locally, works as expected 👍

@doakalexi doakalexi added v8.18.0 and removed v8.17.0 labels Nov 25, 2024
@pmuellr
Copy link
Member

pmuellr commented Nov 26, 2024

I ran with this PR and created 10 rules running 1/s. I seemed to get 4-6 of the following messages, per minute: messages Poll interval configuration is temporarily increased after a decrease in the task load. Sometimes, 3 times within one second.

When I doubled it to 20 rules, and seems like it was about the same volume.

I didn't look to see what's going on there, but wondering if there's some race condition going on.

It also seems confusing that we don't generate a message when we switch back to "fast" mode.

Looking at the requirements from the issue #196584 :

We'll need to ensure the Kibana nodes are still running tasks evenly so one doesn't poll frequently and run all the tasks while the other doesn't poll frequently and doesn't find many tasks to run
Poll interval is 3s whenever the task load doesn't need a 500ms poll interval

Will utilizing task utilization take care of keeping Kibana nodes running tasks evenly? Is running evenly really required anyway?

Did we consider any alternatives to using the task utilization? I was wondering if we could "read ahead" in the task claimer - usually we ask for tasks ready with run_at < ${now}, but we could ask for < ${now + 3s} instead. We wouldn't run any of those rules > ${now}, but we could use them as an indication of whether to delay the next task poll by 3s.

@doakalexi
Copy link
Contributor Author

I didn't look to see what's going on there, but wondering if there's some race condition going on.

Thanks for taking a look! I can take a look to double check.

It also seems confusing that we don't generate a message when we switch back to "fast" mode.

I will add this

Looking at the requirements from the issue #196584 :

We'll need to ensure the Kibana nodes are still running tasks evenly so one doesn't poll frequently and run all the tasks while the other doesn't poll frequently and doesn't find many tasks to run
Poll interval is 3s whenever the task load doesn't need a 500ms poll interval

Will utilizing task utilization take care of keeping Kibana nodes running tasks evenly? Is running evenly really required anyway?

I am not sure, I didn't know the best way to ensure that the nodes run the tasks evenly. I am thinking bc the tasks are partitioned this would not be as big of an issue?

Did we consider any alternatives to using the task utilization? I was wondering if we could "read ahead" in the task claimer - usually we ask for tasks ready with run_at < ${now}, but we could ask for < ${now + 3s} instead. We wouldn't run any of those rules > ${now}, but we could use them as an indication of whether to delay the next task poll by 3s.

I saw some ideas in the issue, but I decided to use utilization right away. I am happy to change it you think reading ahead is better.

@mikecote
Copy link
Contributor

Looking at the requirements from the issue #196584 :

We'll need to ensure the Kibana nodes are still running tasks evenly so one doesn't poll frequently and run all the tasks while the other doesn't poll frequently and doesn't find many tasks to run
Poll interval is 3s whenever the task load doesn't need a 500ms poll interval

Will utilizing task utilization take care of keeping Kibana nodes running tasks evenly? Is running evenly really required anyway?

I am not sure, I didn't know the best way to ensure that the nodes run the tasks evenly. I am thinking bc the tasks are partitioned this would not be as big of an issue?

I think this will be hard to detect. Perhaps something we leave out for now in hindsight.

Did we consider any alternatives to using the task utilization? I was wondering if we could "read ahead" in the task claimer - usually we ask for tasks ready with run_at < ${now}, but we could ask for < ${now + 3s} instead. We wouldn't run any of those rules > ${now}, but we could use them as an indication of whether to delay the next task poll by 3s.

I saw some ideas in the issue, but I decided to use utilization right away. I am happy to change it you think reading ahead is better.

I think anything simple would be good, it would seem complex if we have to modify the claiming query and funnel that to the managed configuration module to save some queries per minute. If we see issues with the current implementation, whether flapping or what not, we can look at alternatives like only poll less frequently if we didn't claim any tasks for X amount of time (ex: 3s)

@pmuellr
Copy link
Member

pmuellr commented Dec 4, 2024

I think anything simple would be good, it would seem complex if we have to modify the claiming query and funnel that to the managed configuration module to save some queries per minute.

Ya, works for me for now. I suspect at some point we're going to have to do some live analysis of "upcoming tasks", and once we do that, we could feed that info into this calculation.

@pmuellr
Copy link
Member

pmuellr commented Dec 4, 2024

I tested this again; 50 rules running 1s intervals, capacity 50. Rules take ~0.5s to run (some kind of no-op query). Here's the executions from the event log, 1s intervals:

image

Kind of interesting there are gaps at all, but I suspect the >= 3s ones are from when it flipped from 0.5s to 3s.
I also changed the message Poll interval configuration changing from ... from a debug to warn, and added tmUtilization to the message.

That is actually one change I would like to see - that message should be a warn anyway, and then we don't need the message logged after it Poll interval configuration is temporarily increased after .... And the utilization # should be in there. We'll want this for diagnostic purposes, otherwise it may be difficult/impossible to figure out what the actual numbers were.

Here are the messages I saw:

14:57:00.396 configuration changing from 500 to 3000 after change in the task load: 22.
14:57:10.397 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:19.174 configuration changing from 500 to 3000 after change in the task load: 0.
14:57:19.174 configuration changing from 500 to 3000 after change in the task load: 0.
14:57:29.274 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:29.275 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:30.628 configuration changing from 500 to 3000 after change in the task load: 0.
14:57:40.628 configuration changing from 3000 to 500 after change in the task load: 100.
14:57:59.315 configuration changing from 500 to 3000 after change in the task load: 20.
14:57:59.315 configuration changing from 500 to 3000 after change in the task load: 20.
14:58:00.762 configuration changing from 500 to 3000 after change in the task load: 0.
14:58:09.332 configuration changing from 3000 to 500 after change in the task load: 100.
14:58:09.333 configuration changing from 3000 to 500 after change in the task load: 100.
14:58:10.771 configuration changing from 3000 to 500 after change in the task load: 100.
14:58:41.144 configuration changing from 500 to 3000 after change in the task load: 4.
14:58:51.159 configuration changing from 3000 to 500 after change in the task load: 100.

We presumably have some internal race condition causing the multiple messages so close together at times.

Beyond that, it seems like we're flipping it too much, presumably too sensitive to the quick changes to the utilization.

I wonder if there's some way we could look at this via a "bigger" window, like check tm utilization over the last minute, or something - I assume the window it's using is much smaller, or completely dynamic?

@mikecote Would it make sense to change the utilization number itself to a wider window, or are we sensitive on that number for other reasons and would maybe want a "smoother" version of it? Thinking of ye olde system load numbers, 1M, 5M, 15M averages (or whatever). Maybe we have "current", "last minute", or such. I suspect a smoother number will yield less flip flopping ...

If we think the level of flipping reported above is ok - maybe it is? - then I think I want the changes logged in the EL, instead of Kibana log. It's just going to be too noisy and an SDH generator if we're logging this much.

@mikecote
Copy link
Contributor

mikecote commented Dec 9, 2024

@mikecote Would it make sense to change the utilization number itself to a wider window, or are we sensitive on that number for other reasons and would maybe want a "smoother" version of it? Thinking of ye olde system load numbers, 1M, 5M, 15M averages (or whatever). Maybe we have "current", "last minute", or such. I suspect a smoother number will yield less flip flopping ...

If we think the level of flipping reported above is ok - maybe it is? - then I think I want the changes logged in the EL, instead of Kibana log. It's just going to be too noisy and an SDH generator if we're logging this much.

Just getting to this message now, sorry for the delay. Let's discuss during today's weekly if we can!

@elasticmachine
Copy link
Contributor

elasticmachine commented Jan 2, 2025

⏳ Build in-progress, with failures

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #31 / Alerting builtin alertTypes circuit_breakers index threshold rule that hits max alerts circuit breaker persist existing alerts to next execution if circuit breaker is hit
  • [job] [logs] x-pack/test/alerting_api_integration/spaces_only/tests/alerting/group1/config.ts / Alerting eventLog in space default should generate expected events for flapping alerts that settle on active where notifyWhen is NOT set to "on status change"
  • [job] [logs] x-pack/test/alerting_api_integration/spaces_only/tests/alerting/group1/config.ts / Alerting eventLog in space default should generate expected events for normal operation
  • [job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration default claim strategy should log a warning when the configuration changes from the starting value
  • [job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration default claim strategy should log a warning when the configuration changes from the starting value
  • [job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should increase configuration at the next interval when an error is emitted
  • [job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should increase configuration at the next interval when an error is emitted
  • [job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should log a warning when the configuration changes from the starting value
  • [job] [logs] Jest Tests #16 / createManagedConfiguration() pollInterval configuration mget claim strategy should log a warning when the configuration changes from the starting value
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns 429 error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns 429 error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns a cluster_block_exception error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with default claim strategy should increase poll interval when Elasticsearch returns a cluster_block_exception error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns "cannot execute [inline] scripts" error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns 429 error
  • [job] [logs] Jest Integration Tests #2 / managed configuration managed poll interval with mget claim strategy should increase poll interval when Elasticsearch returns 429 error
  • [job] [logs] FTR Configs #96 / task_manager health should return basic configuration of task manager
  • [job] [logs] FTR Configs #96 / task_manager health should return basic configuration of task manager

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[mget] Poll for tasks less frequently when the task load doesn't need it
6 participants