-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in TextImageEmbeddingProcessor integration tests #1093
Conversation
i'm curious as to how this was able to pass before. How did you verify this was the issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, please address one minor comment from my side
src/test/java/org/opensearch/neuralsearch/processor/TextImageEmbeddingProcessorIT.java
Show resolved
Hide resolved
Can you address DCO failed check, you can amend your latest commit with "-s" option |
I have the same question. Can you provide more context? |
@will-hwang @zhichao-aws I wasn't able to reproduce this locally, but my thought process was that #1075 added a 2nd integration test to this file when previously there was only 1, and I saw the TESTING.md mention that tests are run in parallel which seems configurable here. Both tests running in parallel and attempting to clean up the same resource would explain the 404 error, and it'd be good practice to ensure our integration tests can run in parallel when possible as well. Not sure why this wouldn't cause issues with other integration tests which use the same pattern though, and on a second pass I see these integration tests shouldn't run in parallel since maxParallelForks is set to 1 in the test task themselves. I also see that the integration tests passed last night and #1091 was resolved without any changes too. I'll take a deeper look, if I can find a different root cause I can close this PR and make another fix |
Signed-off-by: Andy Qin <“[email protected]”>
6076d70
to
dc703e9
Compare
@will-hwang @zhichao-aws, thanks for bringing up this question. I was the one who initially pointed out to @q-andy the possible root cause of the failure. However, after @q-andy’s explanation, it seems unlikely that a race condition is the issue, especially since there are other parts of the code following the same pattern of using the same resource name. The test appears to run sequentially. That said, I believe another possible root cause could be that the test failed before the pipeline was created. When testing locally, I intentionally threw an exception before creating an index and then attempted to delete the index in the One thing to note is that, when I try to delete non existing pipeline, it didn't throw 404 error but returned with 200 OK. Then, I created one sample pipeline and deleted it. After that, if I try to delete non existing pipeline, it throws 404 error. Seems like a bug from OpenSearch core. |
Looking at the CI test report we can see the tests are run sequentially so it's not a parallelization issue. Checking the cluster stdout at the timestamp the tests are run, the real error is a memory circuit breaker triggering while loading/registering the model
This is likely from the integration tests loading a bunch of local models sequentially which uses a lot of memory. So it's the case which @heemin32 mentioned where the test fails in loadModel and enters the finally block early, then attempts to delete resources before they're created throwing the 404. I'll close this PR and open a separate issue to refactor our integration test logic
|
Hi, I was the original author, adding myself as POC for this change if there are any future questions |
Description
The integration tests for TextImageEmbeddingProcessor create and delete resources with the same name leading to issues when running integration tests in parallel, e.g. failed Jenkins run
This PR renames the pipelines/indices per test that are created to prevent a collision where one test deletes a resource needed in another test.
Related Issues
Resolves #1091
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.