Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] msearch hangs when dealing with a high number of records. #517

Closed
manugarri opened this issue Sep 30, 2023 · 1 comment
Closed

[BUG] msearch hangs when dealing with a high number of records. #517

manugarri opened this issue Sep 30, 2023 · 1 comment
Labels
bug Something isn't working untriaged Need triage

Comments

@manugarri
Copy link

manugarri commented Sep 30, 2023

What is the bug?

Im running a search job on a big batch file (900K records). as such, im using multisearch. The cluster has 3 data nodes and 3 master nodes.

I split the records in batches. The weird thing is, if i run batches of 5000 records. the job takes around 200 seconds to process. monitoring aws metrics show no apparent issue with memory/cpu on any of the nodes.

However, if i use 10000 records for the msearch command, something strange happens.

For a while the cluster is performing the search operations, i can see there are active/queued on the threadpool api endpoint /_cat/thread_pool/search . However, after a certain point, there are no more active/queue/rejected threads on the threadpool, but the python msearch call just hangs , and it hangs around for ever. I have to kill the jupyter kernel to make it work.

How can one reproduce the bug?

Cant share the data im using unfortunately, and the data used for search is correlated with the number of records that make the search hang.

But in a nutshell, running this

msearch_result = search_client.msearch(
        msearch_query, 
    )

with a high volume of records makes the job crash, not on the Opensearch side, but on the python client side.

It is important to note that the records are querying any of the 50 or so indices we have, so not all records on the msearch call go to the same index.

However, using the requests library directly (with the aws-auth library for authentication) works perfectly.

#this works with no problem
resp = requests.post( 'https://'+endpoint+'/_msearch', data=msearch_query, headers={'Content-Type': 'application/json'}, timeout=500)

What is the expected behavior?

python client should handle the request, or if the return body from the multisearch operation is too big, raise an appropriate exception

What is your host/environment?

opensearchpy 2.2.0

OS:
ProductName: macOS
ProductVersion: 14.0
BuildVersion: 23A344

@manugarri manugarri added bug Something isn't working untriaged Need triage labels Sep 30, 2023
@manugarri
Copy link
Author

manugarri commented Sep 30, 2023

UPDATE, i realised that the issue is still happening when using the requests library. Im not sure why would an msearch request hang when the cluster is done with the actual search , but it is not an issue with this library.

In fact sometimes the query succeeds but the return message is '{\n "message": "Request Timeout",\n}' Curiously only queries that fail are those that take above 300 seconds, which means this is probably related to some timeout networking settings i cant seem to be able to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged Need triage
Projects
None yet
Development

No branches or pull requests

1 participant