Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky behaviour of integ tests on branch 2.x #534

Closed
vibrantvarun opened this issue Jan 8, 2024 · 2 comments
Closed

[BUG] Flaky behaviour of integ tests on branch 2.x #534

vibrantvarun opened this issue Jan 8, 2024 · 2 comments
Labels
bug Something isn't working untriaged

Comments

@vibrantvarun
Copy link
Member

What is the bug?

The integ tests are failing with below errors on branch 2.x

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Model not ready yet. Please run this first: POST /_plugins/_ml/models/r4oy6owBwxZRcs2gKPsE/_deploy"}],"type":"illegal_argument_exception","reason":"Model not ready yet. Please run this first: POST /_plugins/_ml/models/r4oy6owBwxZRcs2gKPsE/_deploy"},"status":400}
        at __randomizedtesting.SeedInfo.seed([BB6A57C95BCB23CB:E1834FA49423DB9A]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.search(BaseNeuralSearchIT.java:396)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.search(BaseNeuralSearchIT.java:357)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.search(BaseNeuralSearchIT.java:343)

For example during one build run

The below mentioned integ tests failed.

40 tests completed, 18 failed
 - org.opensearch.neuralsearch.processor.NeuralQueryEnricherProcessorIT.testNeuralQueryEnricherProcessor_whenNoModelIdPassed_thenSuccess
 - org.opensearch.neuralsearch.processor.NormalizationProcessorIT.testResultProcessor_whenOneShardAndQueryMatches_thenSuccessful
 - org.opensearch.neuralsearch.processor.NormalizationProcessorIT.testResultProcessor_whenDefaultProcessorConfigAndQueryMatches_thenSuccessful
 - org.opensearch.neuralsearch.processor.NormalizationProcessorIT.testResultProcessor_whenMultipleShardsAndQueryMatches_thenSuccessful
 - org.opensearch.neuralsearch.processor.ScoreCombinationIT.testGeometricMeanCombination_whenOneShardAndQueryMatches_thenSuccessful
 - org.opensearch.neuralsearch.processor.ScoreCombinationIT.testHarmonicMeanCombination_whenOneShardAndQueryMatches_thenSuccessful
 - org.opensearch.neuralsearch.processor.ScoreNormalizationIT.testMinMaxNorm_whenOneShardAndQueryMatches_thenSuccessful
 - org.opensearch.neuralsearch.processor.ScoreNormalizationIT.testL2Norm_whenOneShardAndQueryMatches_thenSuccessful
 - org.opensearch.neuralsearch.processor.TextEmbeddingProcessorIT.testTextEmbeddingProcessor
 - org.opensearch.neuralsearch.processor.TextImageEmbeddingProcessorIT.testEmbeddingProcessor_whenIngestingDocumentWithSourceMatchingTextMapping_thenSuccessful
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBoostQuery
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testRescoreQuery
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testMultimodalQuery
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testFilterQuery
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testNestedQuery
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBooleanQuery_withNeuralAndBM25Queries
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBooleanQuery_withMultipleNeuralQueries

The failed job link

The major error:

ERROR][o.o.m.a.f.TransportForwardAction] [integTest-0] deploy model failed on all nodes, model id: dCky6owBtde70h6mIEFW
?   ? repeated 2 times ?
? ERROR][o.o.m.e.a.DLModel        ] [integTest-0] Failed to deploy model dCky6owBtde70h6mIEFW
?  ai.djl.engine.EngineException: Failed to load PyTorch native library
?  	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:84) ~[pytorch-engine-0.21.0.jar:?]
?  	at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[pytorch-engine-0.21.0.jar:?]
?  	at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[api-0.21.0.jar:?]
?  	at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[opensearch-ml-algorithms-2.12.0.0-SNAPSHOT.jar:?]
?  	at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:280) [opensearch-ml-algorithms-2.12.0.0-SNAPSHOT.jar:?]
?  	at java.base/java.security.AccessController.doPrivileged(AccessController.java:569) [?:?]
?  	at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:247) [opensearch-ml-algorithms-2.12.0.0-SNAPSHOT.jar:?]
?  	at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:139) [opensearch-ml-algorithms-2.12.0.0-SNAPSHOT.jar:?]
?  	at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-2.12.0.0-SNAPSHOT.jar:?]
?  	at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1014) [opensearch-ml-2.12.0.0-SNAPSHOT.jar:2.12.0.0-SNAPSHOT]
?  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0-SNAPSHOT.jar:2.12.0-SNAPSHOT]
?  	at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$71(MLModelManager.java:1511) [opensearch-ml-2.12.0.0-SNAPSHOT.jar:2.12.0.0-SNAPSHOT]
?  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0-SNAPSHOT.jar:2.12.0-SNAPSHOT]
?  	at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0-SNAPSHOT.jar:2.12.0-SNAPSHOT]
?  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [opensearch-2.12.0-SNAPSHOT.jar:2.12.0-SNAPSHOT]
?  	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0-SNAPSHOT.jar:2.12.0-SNAPSHOT]
?  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
?  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
?  	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

How can one reproduce the bug?

When a PR is raised on neural search there are significant chances that gradle check failing due to flaky behavior of tests.

What is the expected behavior?

All checks should passed without having a flaky behavior.

What is your host/environment?

Linux, windows

@vibrantvarun
Copy link
Member Author

The issue has also been created on Ml-commons opensearch-project/ml-commons#1843

@jmazanec15
Copy link
Member

Closing as it was fixed with opensearch-project/ml-commons#1876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

No branches or pull requests

2 participants