-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260
base: main
Are you sure you want to change the base?
[ResponseOps][mget] Poll for tasks less frequently when the task load doesn't need it #200260
Conversation
…i/kibana into task-manager/mget-poll-interval
…i/kibana into task-manager/mget-poll-interval
Pinging @elastic/response-ops (Team:ResponseOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified locally, works as expected 👍
I ran with this PR and created 10 rules running 1/s. I seemed to get 4-6 of the following messages, per minute: messages When I doubled it to 20 rules, and seems like it was about the same volume. I didn't look to see what's going on there, but wondering if there's some race condition going on. It also seems confusing that we don't generate a message when we switch back to "fast" mode. Looking at the requirements from the issue #196584 :
Will utilizing task utilization take care of keeping Kibana nodes running tasks evenly? Is running evenly really required anyway? Did we consider any alternatives to using the task utilization? I was wondering if we could "read ahead" in the task claimer - usually we ask for tasks ready with |
Thanks for taking a look! I can take a look to double check.
I will add this
I am not sure, I didn't know the best way to ensure that the nodes run the tasks evenly. I am thinking bc the tasks are partitioned this would not be as big of an issue?
I saw some ideas in the issue, but I decided to use utilization right away. I am happy to change it you think reading ahead is better. |
I think this will be hard to detect. Perhaps something we leave out for now in hindsight.
I think anything simple would be good, it would seem complex if we have to modify the claiming query and funnel that to the managed configuration module to save some queries per minute. If we see issues with the current implementation, whether flapping or what not, we can look at alternatives like only poll less frequently if we didn't claim any tasks for X amount of time (ex: 3s) |
Ya, works for me for now. I suspect at some point we're going to have to do some live analysis of "upcoming tasks", and once we do that, we could feed that info into this calculation. |
I tested this again; 50 rules running 1s intervals, capacity 50. Rules take ~0.5s to run (some kind of no-op query). Here's the executions from the event log, 1s intervals: Kind of interesting there are gaps at all, but I suspect the >= 3s ones are from when it flipped from 0.5s to 3s. That is actually one change I would like to see - that message should be a Here are the messages I saw:
We presumably have some internal race condition causing the multiple messages so close together at times. Beyond that, it seems like we're flipping it too much, presumably too sensitive to the quick changes to the utilization. I wonder if there's some way we could look at this via a "bigger" window, like check tm utilization over the last minute, or something - I assume the window it's using is much smaller, or completely dynamic? @mikecote Would it make sense to change the utilization number itself to a wider window, or are we sensitive on that number for other reasons and would maybe want a "smoother" version of it? Thinking of ye olde system load numbers, 1M, 5M, 15M averages (or whatever). Maybe we have "current", "last minute", or such. I suspect a smoother number will yield less flip flopping ... If we think the level of flipping reported above is ok - maybe it is? - then I think I want the changes logged in the EL, instead of Kibana log. It's just going to be too noisy and an SDH generator if we're logging this much. |
Just getting to this message now, sorry for the delay. Let's discuss during today's weekly if we can! |
⏳ Build in-progress, with failures
Failed CI Steps
Test Failures
History
|
Resolves #196584
Summary
This PR updates the task poll interval logic for projects using the mget strategy to optimize request loads to Elasticsearch, particularly for smaller projects with low utilization. When task manager (TM) utilization is below 25%, the poll interval will be set to 3 seconds instead of the current 500 milliseconds. This change does not affect projects utilizing
update_by_query
.The existing backpressure logic remains unchanged for handling errors. The only adjustment occurs in scenarios where there are no errors, the TM utilization is below 25%, and the poll interval is less than 3 seconds. In such cases, the poll interval will increase to 3 seconds, even if the backpressure logic has not fully reset the interval to its original value.
I just chose 25%, but I am definitely open to other ideas.
Checklist
To verify
http://localhost:5601/api/task_manager/_health
and verify the poll interval is 3shttp://localhost:5601/api/task_manager/_health
again to verify that with rules running the poll interval is back to 500ms. (It may take a couple refreshes for the health api to reflect the changes)