Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in TextImageEmbeddingProcessor integration tests #1093

Closed

Conversation

ghost
Copy link

@ghost ghost commented Jan 14, 2025

Description

The integration tests for TextImageEmbeddingProcessor create and delete resources with the same name leading to issues when running integration tests in parallel, e.g. failed Jenkins run

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.neuralsearch.processor.TextImageEmbeddingProcessorIT.testEmbeddingProcessor_whenIngestingDocumentWithOrWithoutSourceMatchingMapping_thenSuccessful" -Dtests.seed=E062306E78BCA0FB -Dtests.security.manager=false -Dtests.locale=zh-SG -Dtests.timezone=America/Vancouver -Druntime.java=21

Suite: Test class org.opensearch.neuralsearch.processor.TextImageEmbeddingProcessorIT

  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.neuralsearch.processor.TextImageEmbeddingProcessorIT.testEmbeddingProcessor_whenIngestingDocumentWithOrWithoutSourceMatchingMapping_thenSuccessful" -Dtests.seed=E062306E78BCA0FB -Dtests.security.manager=false -Dtests.locale=zh-SG -Dtests.timezone=America/Vancouver -Druntime.java=21
  2> org.opensearch.client.ResponseException: method [DELETE], host [http://localhost:9200/], URI [/_ingest/pipeline/ingest-pipeline], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"pipeline [ingest-pipeline] is missing"}],"type":"resource_not_found_exception","reason":"pipeline [ingest-pipeline] is missing"},"status":404}
        at __randomizedtesting.SeedInfo.seed([E062306E78BCA0FB:CD13011798778F4D]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:479)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:371)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.deletePipeline(BaseNeuralSearchIT.java:1440)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.wipeOfTestResources(BaseNeuralSearchIT.java:1509)
        at app//org.opensearch.neuralsearch.processor.TextImageEmbeddingProcessorIT.testEmbeddingProcessor_whenIngestingDocumentWithOrWithoutSourceMatchingMapping_thenSuccessful(TextImageEmbeddingProcessorIT.java:57)
  2> NOTE: leaving temporary files on disk at: /tmp/tmpb1cp4gph/neural-search/build/testrun/integTest/temp/org.opensearch.neuralsearch.processor.TextImageEmbeddingProcessorIT_E062306E78BCA0FB-001
  2> NOTE: test params are: codec=Lucene912, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=zh-SG, timezone=America/Vancouver
  2> NOTE: Linux 6.1.109-118.189.amzn2023.aarch64 aarch64/Eclipse Adoptium 21.0.1 (64-bit)/cpus=4,threads=3,free=502566984,total=536870912
  2> NOTE: All tests run in this JVM: [NeuralSearchIT, ValidateDependentPluginInstallationIT, HybridQueryExecutorIT, NeuralQueryEnricherProcessorIT, NeuralSparseTwoPhaseProcessorIT, NormalizationProcessorIT, ScoreCombinationIT, ScoreNormalizationIT, SparseEncodingProcessIT, TextChunkingProcessorIT, TextEmbeddingProcessorIT, TextImageEmbeddingProcessorIT]

This PR renames the pipelines/indices per test that are created to prevent a collision where one test deletes a resource needed in another test.

Related Issues

Resolves #1091

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@will-hwang
Copy link
Contributor

i'm curious as to how this was able to pass before. How did you verify this was the issue?

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, please address one minor comment from my side

@martin-gaievski
Copy link
Member

Can you address DCO failed check, you can amend your latest commit with "-s" option

@zhichao-aws
Copy link
Member

i'm curious as to how this was able to pass before. How did you verify this was the issue?

I have the same question. Can you provide more context?

@ghost
Copy link
Author

ghost commented Jan 14, 2025

@will-hwang @zhichao-aws I wasn't able to reproduce this locally, but my thought process was that #1075 added a 2nd integration test to this file when previously there was only 1, and I saw the TESTING.md mention that tests are run in parallel which seems configurable here. Both tests running in parallel and attempting to clean up the same resource would explain the 404 error, and it'd be good practice to ensure our integration tests can run in parallel when possible as well.

Not sure why this wouldn't cause issues with other integration tests which use the same pattern though, and on a second pass I see these integration tests shouldn't run in parallel since maxParallelForks is set to 1 in the test task themselves. I also see that the integration tests passed last night and #1091 was resolved without any changes too. I'll take a deeper look, if I can find a different root cause I can close this PR and make another fix

@ghost ghost force-pushed the fix-it-race-condition branch from 6076d70 to dc703e9 Compare January 14, 2025 16:59
@heemin32
Copy link
Collaborator

heemin32 commented Jan 14, 2025

@will-hwang @zhichao-aws, thanks for bringing up this question. I was the one who initially pointed out to @q-andy the possible root cause of the failure. However, after @q-andy’s explanation, it seems unlikely that a race condition is the issue, especially since there are other parts of the code following the same pattern of using the same resource name. The test appears to run sequentially.

That said, I believe another possible root cause could be that the test failed before the pipeline was created.

When testing locally, I intentionally threw an exception before creating an index and then attempted to delete the index in the finally block. This resulted in a 404 error, which masked the original failure. To address this, we should update the code to ensure the original error is not hidden.

One thing to note is that, when I try to delete non existing pipeline, it didn't throw 404 error but returned with 200 OK. Then, I created one sample pipeline and deleted it. After that, if I try to delete non existing pipeline, it throws 404 error. Seems like a bug from OpenSearch core.

@ghost
Copy link
Author

ghost commented Jan 14, 2025

Looking at the CI test report we can see the tests are run sequentially so it's not a parallelization issue. Checking the cluster stdout at the timestamp the tests are run, the real error is a memory circuit breaker triggering while loading/registering the model

[2025-01-13T02:43:56,582][ERROR][o.o.m.a.r.TransportRegisterModelAction] [node_name_9200] Failed to register model
org.opensearch.transport.RemoteTransportException: [node_name_9200][127.0.0.1:9300][cluster:admin/opensearch/mlinternal/forward]
Caused by: org.opensearch.core.common.breaker.CircuitBreakingException: Memory Circuit Breaker is open, please check your resources!

This is likely from the integration tests loading a bunch of local models sequentially which uses a lot of memory. So it's the case which @heemin32 mentioned where the test fails in loadModel and enters the finally block early, then attempts to delete resources before they're created throwing the 404.

I'll close this PR and open a separate issue to refactor our integration test logic

  • Resource cleanup after every test rather than a part of every test, so cleanup won't directly cause test failures
  • Log errors more clearly during resource cleanup to avoid confusion

@q-andy
Copy link

q-andy commented Jan 16, 2025

Hi, I was the original author, adding myself as POC for this change if there are any future questions

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut Created by automated workflow backport 2.x Label will add auto workflow to backport PR to 2.x branch skip-changelog v2.19.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Integration Test Failed for neural-search-2.19.0
5 participants