Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky integ tests #487

Closed
wants to merge 2 commits into from

Conversation

tanqiuliu
Copy link

Description

  • Fixed the flaky integration tests in [BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384. The test failures were due to the model_state has not yet changed to UNDEPLOYED after the undeploy invocation. Added a poller to wait for the state change then move forward.
  • Updated BaseNeuralSearchIT.findDeployedModels() to ensure model_state = DEPLOYED. This will avoid encountering the following exception when performing a neural search in integ tests.
org.opensearch.client.ResponseException: method [POST], host [http://[::1]:56962], URI [/test-neural-basic-index/_search?size=1], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Model not ready yet. Please run this first: POST /_plugins/_ml/models/7oMmqIsBDNjwFOltGQ_N/_deploy"}],"type":"illegal_argument_exception","reason":"Model not ready yet. Please run this first: POST /_plugins/_ml/models/7oMmqIsBDNjwFOltGQ_N/_deploy"},"status":400}

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Tanqiu Liu <[email protected]>
@navneet1v
Copy link
Collaborator

navneet1v commented Nov 7, 2023

@tanqiuliu can you add an entry in the change log?

Also, can you run this command in neural search and paste the output as with this seed the tests failed.

gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery" -Dtests.seed=DFDA155AC6A13FB4 -Dtests.security.manager=false -Dtests.locale=ga -Dtests.timezone=America/La_Paz

@navneet1v navneet1v added bug Something isn't working Maintenance Add support for new versions of OpenSearch/Dashboards from upstream integ-test-failure Integration test failures labels Nov 7, 2023
Copy link

codecov bot commented Nov 7, 2023

Codecov Report

Merging #487 (371a9dd) into main (cda2f82) will not change coverage.
Report is 1 commits behind head on main.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #487   +/-   ##
=========================================
  Coverage     84.37%   84.37%           
  Complexity      498      498           
=========================================
  Files            40       40           
  Lines          1491     1491           
  Branches        228      228           
=========================================
  Hits           1258     1258           
  Misses          133      133           
  Partials        100      100           

@martin-gaievski
Copy link
Member

Some workflows in CI are failing, is this same as initial problem or related/caused by this change?

https://github.com/opensearch-project/neural-search/actions/runs/6781161374/job/18449994428?pr=487#step:4:80

REPRODUCE WITH: gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralSparseQueryIT.testRescoreQuery" -Dtests.seed=1F7B3ED0649C5765 -Dtests.security.manager=false -Dtests.locale=et -Dtests.timezone=Europe/Sofia -Druntime.java=17
org.opensearch.neuralsearch.query.NeuralSparseQueryIT > testRescoreQuery FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([1F7B3ED0649C5765:34DD958B2EADE4C7]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.common.BaseNeuralSearchIT.getDeployedModelId(BaseNeuralSearchIT.java:7[81](https://github.com/opensearch-project/neural-search/actions/runs/6781161374/job/18449994428?pr=487#step:4:82))
        at org.opensearch.neuralsearch.query.NeuralSparseQueryIT.testRescoreQuery(NeuralSparseQueryIT.java:135)

@tanqiuliu
Copy link
Author

@navneet1v I executed that command multiple times trying to reproduce the issue in #384, butut it always succeeds. It is possible that sometimes the 3 seconds wait time in previous implementation is not long enough.

tanqiuliu@Tanqius-MacBook-Pro neural-search % ./gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery" -Dtests.seed=DFDA155AC6A13FB4 -Dtests.security.manager=false -Dtests.locale=ga -Dtests.timezone=America/La_Paz
Starting a Gradle Daemon (subsequent builds will be faster)
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.1.1
  OS Info               : Mac OS X 14.1 (x86_64)
  JDK Version           : 17 (Amazon Corretto JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home
  Random Testing Seed   : DFDA155AC6A13FB4
  In FIPS 140 mode      : false
=======================================

BUILD SUCCESSFUL in 1m 23s
12 actionable tasks: 7 executed, 5 up-to-date

@navneet1v
Copy link
Collaborator

@navneet1v I executed that command multiple times trying to reproduce the issue in #384, butut it always succeeds. It is possible that sometimes the 3 seconds wait time in previous implementation is not long enough.

Block (15 lines)

tanqiuliu@Tanqius-MacBook-Pro neural-search % ./gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery" -Dtests.seed=DFDA155AC6A13FB4 -Dtests.security.manager=false -Dtests.locale=ga -Dtests.timezone=America/La_Paz
Starting a Gradle Daemon (subsequent builds will be faster)
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.1.1
  OS Info               : Mac OS X 14.1 (x86_64)
  JDK Version           : 17 (Amazon Corretto JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home
  Random Testing Seed   : DFDA155AC6A13FB4
  In FIPS 140 mode      : false
=======================================

BUILD SUCCESSFUL in 1m 23s
12 actionable tasks: 7 executed, 5 up-to-date

Got it. But as a part of reproducing the bug running the same command which failed the tests is the best way to reproduce the issue. But I am glad that its not a problem with src, and its mainly around setup.

Apart from this can you add the entry for this in the changelog.md file so that GH workflow can succeed.

@@ -623,6 +624,16 @@ protected void deleteModel(String modelId) {
);
}

@SneakyThrows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove SneakyThrows from here and add the exception in method signature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 628 to 634
protected void pollForModelState(String modelId, MLModelState expectedModelState, int intervalMs, int maxAttempts) {
for (int i = 0; i < maxAttempts; i++) {
Thread.sleep(intervalMs);
if (expectedModelState.equals(getModelState(modelId))) {
return;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better improve the visibility around why tests are failing, I would say at the end of for loop if the model is not in the correct state we should fail the tests, proper messaging for the failure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

.filter(
hitsMap -> !Objects.isNull(hitsMap)
&& hitsMap.containsKey("model_id")
&& MLModelState.DEPLOYED.equals(getModelState(hitsMap.get("model_id").toString()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MLModelState is an ENUM, so you can use == to compare 2 enums which will ensure that compile time check fields which are getting equated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 747 to 750
.filter(
hitsMap -> !Objects.isNull(hitsMap)
&& hitsMap.containsKey("model_id")
&& MLModelState.DEPLOYED.equals(getModelState(hitsMap.get("model_id").toString()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no deployed models are found what is the expectation from the tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not deployed model was found, the test will fail in the assertion here: https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java#L748
As I mentioned below, there seems to be a latency between task complete and model_state update to DEPLOYED. So I added a poll for model state there as well to avoid the issue.

@tanqiuliu
Copy link
Author

tanqiuliu commented Nov 8, 2023

@martin-gaievski This is another flaky issue. It previously appears as Model not ready yet. as I mentioned in the PR description and was changed to the assertion failure since I added a filter on the model state. The root cause seems to be, during the initialization of an IT, it will invoke loadModel() to deploy the model and poll for the task to complete. However, it seems when the task is updated as COMPLETED, there could be a latency before model_state shown as DEPLOYED, and before that the model is not usable.
I added a poll for model_state = DEPLOYED in loadModel() to avoid running into such cases.

// after model undeploy returns, the max interval to update model status is 3s in ml-commons CronJob.
Thread.sleep(3000);
// wait for model undeploy to complete
pollForModelState(modelId, Set.of(MLModelState.UNDEPLOYED, MLModelState.DEPLOY_FAILED), 3000, 5);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometime the undeploy action results in a DEPLOY_FAILED state. But this does not block the model from being deleted. So set both UNDEPLOYED and DEPLOY_FAILED as exit state.

@tanqiuliu
Copy link
Author

@navneet1v Hi Navneet, I've addressed your comments. Will you be able to take a look?

@tanqiuliu
Copy link
Author

All checks have passed. Will I be able to get a review here?

@@ -623,6 +626,28 @@ protected void deleteModel(String modelId) {
);
}

protected void pollForModelState(String modelId, Set<MLModelState> exitModelStates, int intervalMs, int maxAttempts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make interval and maxAttempts a class level constants, like DEFAULT_<item_name> and remove them from a method signature? That can be added later if needed, but most of the times I think default will just work.

@@ -623,6 +626,28 @@ protected void deleteModel(String modelId) {
);
}

protected void pollForModelState(String modelId, Set<MLModelState> exitModelStates, int intervalMs, int maxAttempts)
throws InterruptedException {
MLModelState currentState = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this initialization into the loop, it's not required to have it in a top level scope

@@ -733,11 +758,33 @@ protected Set<String> findDeployedModels() {
List<Map<String, Object>> innerHitsMap = (List<Map<String, Object>>) hits.get("hits");
return innerHitsMap.stream()
.map(hit -> (Map<String, Object>) hit.get("_source"))
.filter(hitsMap -> !Objects.isNull(hitsMap) && hitsMap.containsKey("model_id"))
.filter(
hitsMap -> !Objects.isNull(hitsMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects.notNull()

EntityUtils.toString(getModelResponse.getEntity()),
false
);
return MLModelState.valueOf((String) getModelResponseJson.get("model_state"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check if "model_state" key is present in response? I can image in case of service or network error it's possible, we can assert on this

@@ -7,6 +7,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
### Features
### Enhancements
### Bug Fixes
- Fixed flaky integration tests caused by model_state transition latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please include link to PR

@navneet1v
Copy link
Collaborator

@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR.

@tanqiuliu
Copy link
Author

@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR.

I can address the comments and update the PR when I have time, probably this weekend

@navneet1v
Copy link
Collaborator

Closing this PR in favor of #559. Please refer GH issue for more details.
#384

@navneet1v navneet1v closed this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integ-test-failure Integration test failures Maintenance Add support for new versions of OpenSearch/Dashboards from upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants