-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][Search Backpressure] High Heap Usage Cancellation Due to High Node-Level CPU Utilization #13295
Comments
Assigning to @kaushalmahi12, due to his prior context with query sandboxing and search backpressure |
The backpressure works as follows and here the heap_domination threshold is mere 0.05 percent of total jvm memroy available for the process. The same flowchart is applicable for both There are basically three trackers which can potentially cancel a task |
@kaushalmahi12 Agreed and that is what the issue is trying to explain. I think we should check the duress condition for each tracker as well. For example: If under heap duress, then only evaluate the task for heap based cancellation. |
Thats right @sohami I think we should increase this threshold for search workload JVM(or remove it altogether) and separate out the corresponding trackers. |
Looks like this was fixed and released in 2.15, @kaushalmahi12 please re-open if that's not correct. Thanks! |
Describe the bug
With the current search backpressure cancellation logic, we've noticed that some high CPU usage search requests, such as multi-term aggregation, may result in more cancellations due to task-level heap usage settings. However, the system still has sufficient heap memory to process the tasks.
Related component
Search:Resiliency
To Reproduce
Use
multi_term_agg
in http_logs workload. It's often referred to as a high CPU usage search request.multi_term_agg
operation in http_logs workload and gradually increase the search client using below sample commandGET _nodes/stats/search_backpressure
restful APIExpected behavior
We need to adjust the current search backpressure cancellation logic to cancel tasks based on measurements of node-level resources. For example, if a node is under duress due to high CPU utilization, we should only consider canceling tasks based on CPU settings, rather than heap or elapsed time settings at the task level.
Additional Details
Host/Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: