-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Deeply nested aggregations are not terminable by any mechanism and cause Out of Memory errors in data nodes. #15413
Comments
Search Meetup Triage: @jainankitk / @sgup432 Do you have some context on this? @Pigueiras2 Did you also try specify total bucket counts (reduce than defaults) as well? |
Do you mean changing |
@Pigueiras / @Pigueiras2 |
I send the query to my cluster at Sat Aug 31 11:45:57 PM CEST 2024 and one of the nodes crashed at 23:48:55,614 and the other one at 23:49:00,778. This is what heap reported by Logs of one of the datanodes before crashing (I see zero entries about "o.o.s.b.SearchBackpressureService" in the cluster logs):
These are the current settings of the cluster (in case you see anything wrong with them):
These are what was reported by the
If you want me to add any other extra information or test any other thing let me know. |
This is odd. There might be nodestats for search backpressure that you can also share
The output would be something like this for both, search (coordinator) and shard tasks...
One possibility is that the task cancellation itself is self-throttling, and you will need to tinker with those values to avoid throttling. (Ref: https://opensearch.org/docs/2.15/tuning-your-cluster/availability-and-recovery/search-backpressure/ ) Could you capture and share the histogram dumps?? The multiple samples will reveal which objects are rapidly growing in count and hogging the memory. The failed allocations at the time of the OOME is more in line with the available heap memory that is exhausted and not the cause. Also, I'm assuming you are not running any painless scripts. |
First of all, thanks a lot for taking the time to answer. It's really appreciated 😄
Yes, I’m plotting
Does the way search backpressure cancels a task differ from me calling About tinkering the values of backpressure, I've also tried with:
And I don't even see the message about the search backpressure service trying to kill a task (the node should be under duress with those values for about 30/40 seconds and the task should be killed either for time or heap usage).
This one is right before crashing. Does it provide the information you were looking for?
No, the only thing running in the cluster is this query which is basically nested terms aggregation (with huge sizes) + date histogram. |
@Pigueiras , we'll need to dig deeper. This is a good start. But we'll need multiple histogram dumps at regular intervals to see what is growing rapidly. |
@kkhatua I created this repository with a dump of the histogram and hot_thread approximately every second from the beginning of a query until the OOME. The files are in HHMMSS.sss format, and the OOME occurs at 22:13:44. |
This will take some time, @Pigueiras
From the few usable
Still unclear why isn't the Search Backpressure (SBP) module detecting this from the time it sees the allocations climb rapidly from I believe you've already set this |
No problem, I completely understand that I have only one issue and you have to handle many. Don’t feel obligated to answer quickly if I do 👍
I have tried with
So the nodes are always considered under duress, yet I cannot see any logs about backpressure. The only ones that appear related to SBP are the ones when the data node starts and changes the default values: |
@kkhatua Did you have any time to look at it since my last comment 👼 ? |
Describe the bug
We have a cluster with 12 data nodes and 31 GB reserved for the JVM. We were experiencing sporadic Out of Memory errors and managed to isolate the issue to some dashboards that were using nested aggregations with arbitrarily large sizes. We tried different approaches to terminate these client searches before they could crash some of the nodes in the cluster, but none of them worked (as described below).
The query running behind the scenes in Grafana/Dashboards was something similar to:
We tried the following settings in our cluster:
default_search_timeout
andcancel_after_time_interval
don’t have any effect. You can see this in the task monitoring:For example, it runs for 2-3 minutes before crashing the data nodes:
If you try to kill the tasks manually with
_tasks/node:task/_cancel
the cluster simply ignores it.Circuitbreakers settings (
indices.breaker.request.limit
,indices.breaker.request.overhead
, ...) are designed to prevent out-of-memory errors by estimating the memory usage of requests. However, it doesn't look like OpenSearch is taking into account these aggregations to estimate the memory usage accurately in advance, leading to the query being accepted even if it eventually consumes a lot of memory.Backpressure is triggered, but it never actually kills the problematic query. The message about “heap usage not dominated by search requests” makes me think that aggregations follow a completely different workflow in memory usage tracking in OpenSearch, which is why they are not handled by the circuit breakers or backpressure mechanisms.
max_buckets
doesn’t seem to have an effect because it is only triggered in the reduce phase. Only if the "size" of the aggregation is reasonable and OpenSearch can compute the query, then you can hit the limit…We've run out of ideas, so please let us know if there's something really missing from OpenSearch or if you have any other suggestions to try. We would appreciate it! 😄
Related component
Search:Aggregations
To Reproduce
Expected behavior
cancel_query_after_time
worked, it would be very useful. If a query takes more than 30 seconds, something is likely wrong. A query of this type was taking more than 2 minutes before it could kill some data nodes in the cluster.Additional Details
Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-skills
opensearch-sql
repository-s3
Host/Environment:
The text was updated successfully, but these errors were encountered: