-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky integ tests #487
Conversation
Signed-off-by: Tanqiu Liu <[email protected]>
@tanqiuliu can you add an entry in the change log? Also, can you run this command in neural search and paste the output as with this seed the tests failed.
|
Codecov Report
@@ Coverage Diff @@
## main #487 +/- ##
=========================================
Coverage 84.37% 84.37%
Complexity 498 498
=========================================
Files 40 40
Lines 1491 1491
Branches 228 228
=========================================
Hits 1258 1258
Misses 133 133
Partials 100 100 |
Some workflows in CI are failing, is this same as initial problem or related/caused by this change?
|
@navneet1v I executed that command multiple times trying to reproduce the issue in #384, butut it always succeeds. It is possible that sometimes the 3 seconds wait time in previous implementation is not long enough.
|
Got it. But as a part of reproducing the bug running the same command which failed the tests is the best way to reproduce the issue. But I am glad that its not a problem with src, and its mainly around setup. Apart from this can you add the entry for this in the changelog.md file so that GH workflow can succeed. |
@@ -623,6 +624,16 @@ protected void deleteModel(String modelId) { | |||
); | |||
} | |||
|
|||
@SneakyThrows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove SneakyThrows from here and add the exception in method signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
protected void pollForModelState(String modelId, MLModelState expectedModelState, int intervalMs, int maxAttempts) { | ||
for (int i = 0; i < maxAttempts; i++) { | ||
Thread.sleep(intervalMs); | ||
if (expectedModelState.equals(getModelState(modelId))) { | ||
return; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To better improve the visibility around why tests are failing, I would say at the end of for loop if the model is not in the correct state we should fail the tests, proper messaging for the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
.filter( | ||
hitsMap -> !Objects.isNull(hitsMap) | ||
&& hitsMap.containsKey("model_id") | ||
&& MLModelState.DEPLOYED.equals(getModelState(hitsMap.get("model_id").toString())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MLModelState is an ENUM, so you can use == to compare 2 enums which will ensure that compile time check fields which are getting equated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.filter( | ||
hitsMap -> !Objects.isNull(hitsMap) | ||
&& hitsMap.containsKey("model_id") | ||
&& MLModelState.DEPLOYED.equals(getModelState(hitsMap.get("model_id").toString())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no deployed models are found what is the expectation from the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not deployed model was found, the test will fail in the assertion here: https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java#L748
As I mentioned below, there seems to be a latency between task complete and model_state
update to DEPLOYED. So I added a poll for model state there as well to avoid the issue.
@martin-gaievski This is another flaky issue. It previously appears as |
Signed-off-by: Tanqiu Liu <[email protected]>
// after model undeploy returns, the max interval to update model status is 3s in ml-commons CronJob. | ||
Thread.sleep(3000); | ||
// wait for model undeploy to complete | ||
pollForModelState(modelId, Set.of(MLModelState.UNDEPLOYED, MLModelState.DEPLOY_FAILED), 3000, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometime the undeploy action results in a DEPLOY_FAILED
state. But this does not block the model from being deleted. So set both UNDEPLOYED
and DEPLOY_FAILED
as exit state.
@navneet1v Hi Navneet, I've addressed your comments. Will you be able to take a look? |
All checks have passed. Will I be able to get a review here? |
@@ -623,6 +626,28 @@ protected void deleteModel(String modelId) { | |||
); | |||
} | |||
|
|||
protected void pollForModelState(String modelId, Set<MLModelState> exitModelStates, int intervalMs, int maxAttempts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make interval and maxAttempts a class level constants, like DEFAULT_<item_name> and remove them from a method signature? That can be added later if needed, but most of the times I think default will just work.
@@ -623,6 +626,28 @@ protected void deleteModel(String modelId) { | |||
); | |||
} | |||
|
|||
protected void pollForModelState(String modelId, Set<MLModelState> exitModelStates, int intervalMs, int maxAttempts) | |||
throws InterruptedException { | |||
MLModelState currentState = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this initialization into the loop, it's not required to have it in a top level scope
@@ -733,11 +758,33 @@ protected Set<String> findDeployedModels() { | |||
List<Map<String, Object>> innerHitsMap = (List<Map<String, Object>>) hits.get("hits"); | |||
return innerHitsMap.stream() | |||
.map(hit -> (Map<String, Object>) hit.get("_source")) | |||
.filter(hitsMap -> !Objects.isNull(hitsMap) && hitsMap.containsKey("model_id")) | |||
.filter( | |||
hitsMap -> !Objects.isNull(hitsMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objects.notNull()
EntityUtils.toString(getModelResponse.getEntity()), | ||
false | ||
); | ||
return MLModelState.valueOf((String) getModelResponseJson.get("model_state")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check if "model_state" key is present in response? I can image in case of service or network error it's possible, we can assert on this
@@ -7,6 +7,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
### Features | |||
### Enhancements | |||
### Bug Fixes | |||
- Fixed flaky integration tests caused by model_state transition latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please include link to PR
@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR. |
I can address the comments and update the PR when I have time, probably this weekend |
Description
model_state
has not yet changed toUNDEPLOYED
after theundeploy
invocation. Added a poller to wait for the state change then move forward.BaseNeuralSearchIT.findDeployedModels()
to ensuremodel_state = DEPLOYED
. This will avoid encountering the following exception when performing a neural search in integ tests.Issues Resolved
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.