You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im running a search job on a big batch file (900K records). as such, im using multisearch. The cluster has 3 data nodes and 3 master nodes.
I split the records in batches. The weird thing is, if i run batches of 5000 records. the job takes around 200 seconds to process. monitoring aws metrics show no apparent issue with memory/cpu on any of the nodes.
However, if i use 10000 records for the msearch command, something strange happens.
For a while the cluster is performing the search operations, i can see there are active/queued on the threadpool api endpoint /_cat/thread_pool/search . However, after a certain point, there are no more active/queue/rejected threads on the threadpool, but the python msearch call just hangs , and it hangs around for ever. I have to kill the jupyter kernel to make it work.
How can one reproduce the bug?
Cant share the data im using unfortunately, and the data used for search is correlated with the number of records that make the search hang.
with a high volume of records makes the job crash, not on the Opensearch side, but on the python client side.
It is important to note that the records are querying any of the 50 or so indices we have, so not all records on the msearch call go to the same index.
However, using the requests library directly (with the aws-auth library for authentication) works perfectly.
#this works with no problem
resp = requests.post( 'https://'+endpoint+'/_msearch', data=msearch_query, headers={'Content-Type': 'application/json'}, timeout=500)
What is the expected behavior?
python client should handle the request, or if the return body from the multisearch operation is too big, raise an appropriate exception
UPDATE, i realised that the issue is still happening when using the requests library. Im not sure why would an msearch request hang when the cluster is done with the actual search , but it is not an issue with this library.
In fact sometimes the query succeeds but the return message is '{\n "message": "Request Timeout",\n}' Curiously only queries that fail are those that take above 300 seconds, which means this is probably related to some timeout networking settings i cant seem to be able to find.
What is the bug?
Im running a search job on a big batch file (900K records). as such, im using multisearch. The cluster has 3 data nodes and 3 master nodes.
I split the records in batches. The weird thing is, if i run batches of 5000 records. the job takes around 200 seconds to process. monitoring aws metrics show no apparent issue with memory/cpu on any of the nodes.
However, if i use 10000 records for the msearch command, something strange happens.
For a while the cluster is performing the search operations, i can see there are active/queued on the threadpool api endpoint
/_cat/thread_pool/search
. However, after a certain point, there are no more active/queue/rejected threads on the threadpool, but the pythonmsearch
call just hangs , and it hangs around for ever. I have to kill the jupyter kernel to make it work.How can one reproduce the bug?
Cant share the data im using unfortunately, and the data used for search is correlated with the number of records that make the search hang.
But in a nutshell, running this
with a high volume of records makes the job crash, not on the Opensearch side, but on the python client side.
It is important to note that the records are querying any of the 50 or so indices we have, so not all records on the msearch call go to the same index.
However, using the requests library directly (with the aws-auth library for authentication) works perfectly.
What is the expected behavior?
python client should handle the request, or if the return body from the multisearch operation is too big, raise an appropriate exception
What is your host/environment?
opensearchpy 2.2.0
OS:
ProductName: macOS
ProductVersion: 14.0
BuildVersion: 23A344
The text was updated successfully, but these errors were encountered: