Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds deploy model flag support for local model registration, fixes integration tests #350

Merged
merged 24 commits into from
Jan 5, 2024

Conversation

joshpalis
Copy link
Member

@joshpalis joshpalis commented Jan 2, 2024

Description

This PR achieves multiple things :

  • Enables register local model and deploy model integration test
  • Adds support for setting the deploy flag to true for local model registration
  • Modifies the AbstractRetryableWorkflowStep retryableGetMLTask to accept an action listener and moves completing the future/ simulating model deployment within the step classes themselves
  • Fixes the DeprovisionWorkflowTransportAction so that the deprovision sequence is ascertained from the resources created, rather than the provision sequence
  • Adds a new 1 step template to test out register local model and deployment
  • Mitigates ML Circuit breaker issues due to memory intensive tasks (due to multiple local / remote model registration tests). These tests have been reduced to just 2 tests for local and remote model registration. Github runners have limited memory/storage

Resources Created output for the single step register local model template :

 curl -i -XGET "localhost:9200/_plugins/_flow_framework/workflow/Cs_q0YwBbQBv2_MrXO2l/_status?all=true&pretty"
HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 603

{
  "workflow_id" : "Cs_q0YwBbQBv2_MrXO2l",
  "state" : "COMPLETED",
  "provisioning_progress" : "DONE",
  "provision_start_time" : 1704328870019,
  "provision_end_time" : 1704328885091,
  "resources_created" : [
    {
      "workflow_step_name" : "register_local_model",
      "workflow_step_id" : "workflow_step_1",
      "resource_type" : "model_id",
      "resource_id" : "Dc_q0YwBbQBv2_MruO2z"
    },
    {
      "workflow_step_name" : "deploy_model",
      "workflow_step_id" : "workflow_step_1-deploy",
      "resource_type" : "model_id",
      "resource_id" : "Dc_q0YwBbQBv2_MruO2z"
    }
  ]
}

Issues Resolved

Fixes #345

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@joshpalis joshpalis added the backport 2.x backport PRs to 2.x branch label Jan 2, 2024
Copy link

codecov bot commented Jan 2, 2024

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (a8e7ff5) 72.42% compared to head (46f15dc) 72.27%.

❗ Current head 46f15dc differs from pull request most recent head 669db42. Consider uploading reports for the commit 669db42 to get more accurate results

Files Patch % Lines
...flowframework/workflow/RegisterLocalModelStep.java 50.00% 17 Missing and 2 partials ⚠️
.../transport/DeprovisionWorkflowTransportAction.java 73.07% 4 Missing and 3 partials ⚠️
...mework/workflow/AbstractRetryableWorkflowStep.java 28.57% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #350      +/-   ##
============================================
- Coverage     72.42%   72.27%   -0.16%     
- Complexity      571      572       +1     
============================================
  Files            72       73       +1     
  Lines          2988     2997       +9     
  Branches        226      230       +4     
============================================
+ Hits           2164     2166       +2     
- Misses          721      727       +6     
- Partials        103      104       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we keep the "retry taskId until we get a resource ID" and "update resources in index" functionality in appropriate classes.

Signed-off-by: Joshua Palis <[email protected]>
@joshpalis
Copy link
Member Author

Some flaky test failures for local model registration, curious if deploying multiple local models is the root cause of this. I'll check if undeploying the local model before proceeding with additional tests would resolve this:

 Caused by: org.opensearch.flowframework.exception.FlowFrameworkException: Deploy model failed with error : {"XNnALvoTRJO3CUYOHLYvUA":"Memory Circuit Breaker is open, please check your resources!"}
»  	at org.opensearch.flowframework.workflow.AbstractRetryableWorkflowStep.lambda$retryableGetMlTask$5(AbstractRetryableWorkflowStep.java:161) ~[?:?]
»  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
»  	at org.opensearch.ml.client.MachineLearningNodeClient.lambda$getMLTaskResponseActionListener$9(MachineLearningNodeClient.java:302) ~[?:?]
»  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
»  	at org.opensearch.ml.client.MachineLearningNodeClient.lambda$wrapActionListener$18(MachineLearningNodeClient.java:374) ~[?:?]
»  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
»  	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
»  	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
»  	at org.opensearch.ml.action.tasks.GetTaskTransportAction.lambda$doExecute$0(GetTaskTransportAction.java:67) ~[?:?]

…e, ascertaining deprovision sequence from created resources

Signed-off-by: Joshua Palis <[email protected]>
@joshpalis
Copy link
Member Author

Still observing flaky test failures, to mitigate this I have set the native memory threshold from 90 to 100 to prevent the circuit breaker from opening (Documentation), however the logs still show that the breaker is open. Will continue to look into this :

ERROR][o.o.f.t.ProvisionWorkflowTransportAction] [integTest-0] Provisioning failed for workflow: LK-n0YwBmExBUdkP7lAh
�  java.util.concurrent.CompletionException: java.util.concurrent.ExecutionException: org.opensearch.flowframework.exception.FlowFrameworkException: Memory Circuit Breaker is open, please check your resources!
�  	at java.base/java.util.concurrent.CompletableFuture.reportJoin(CompletableFuture.java:413) ~[?:?]
�  	at java.base/java.util.concurrent.CompletableFuture.join(CompletableFuture.java:2118) ~[?:?]
�  	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) ~[?:?]

Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one strong suggestion for improvement.

Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good overall with minor comments.

…ry setting instead of native memory heap setting

Signed-off-by: Joshua Palis <[email protected]>
…ith deployed flag, testing remote model registration with deploy step

Signed-off-by: Joshua Palis <[email protected]>
Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Approving with a comment that I'm not sure an id-suffix is needed any more. If it is, it's fine (or make it a constant or auto-generated).

Signed-off-by: Joshua Palis <[email protected]>
@owaiskazi19
Copy link
Member

Did the CI miss 2 checks? We have 21 in total

@joshpalis joshpalis merged commit 760177a into opensearch-project:main Jan 5, 2024
19 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/flow-framework/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/flow-framework/backport-2.x
# Create a new branch
git switch --create backport/backport-350-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 760177a82a329312ed24385a82bd4a8b21f3bb41
# Push it to GitHub
git push --set-upstream origin backport/backport-350-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/flow-framework/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-350-to-2.x.

joshpalis added a commit to joshpalis/opensearch-ai-flow-framework that referenced this pull request Jan 5, 2024
…tegration tests (opensearch-project#350)

* Fixing local model integration test

Signed-off-by: Joshua Palis <[email protected]>

* Added deploy model flag support for local model registration, added associated integration test

Signed-off-by: Joshua Palis <[email protected]>

* Fixing comment

Signed-off-by: Joshua Palis <[email protected]>

* Fixing deprovision workflow transport action, removing use of template, ascertaining deprovision sequence from created resources

Signed-off-by: Joshua Palis <[email protected]>

* Removing rest status checks for deprovision API tests

Signed-off-by: Joshua Palis <[email protected]>

* Increasing wait time for deprovision status

Signed-off-by: Joshua Palis <[email protected]>

* Removing sdeprovision status checks for model deployment tests

Signed-off-by: Joshua Palis <[email protected]>

* increasing timeout for local model registration test template

Signed-off-by: Joshua Palis <[email protected]>

* Reverting timeout increase, setting ML Commons native memory threshold to 100 to avoid opening circuit breaker

Signed-off-by: Joshua Palis <[email protected]>

* Passing an action listener to retryableGetMlTask

Signed-off-by: Joshua Palis <[email protected]>

* Addressing PR comments, preserving order of resource map

Signed-off-by: Joshua Palis <[email protected]>

* Testing if a wait time after deprovisioning will mitigate circuit breaker issues

Signed-off-by: Joshua Palis <[email protected]>

* Increasing mlconfig index creation wait time

Signed-off-by: Joshua Palis <[email protected]>

* Combining local model registration tests into one

Signed-off-by: Joshua Palis <[email protected]>

* removing resource map from deprovision workflow transport action

Signed-off-by: Joshua Palis <[email protected]>

* Fixing getResourceFromDeprovisionNOde and tests

Signed-off-by: Joshua Palis <[email protected]>

* Separating out local model registration tests, using ml jvm heap memory setting instead of native memory heap setting

Signed-off-by: Joshua Palis <[email protected]>

* Testing : removing second local model registration test

Signed-off-by: Joshua Palis <[email protected]>

* Reducing model registration tests, testing local model registration with deployed flag, testing remote model registration with deploy step

Signed-off-by: Joshua Palis <[email protected]>

* Removing suffix from simulated deploy model step

Signed-off-by: Joshua Palis <[email protected]>

---------

Signed-off-by: Joshua Palis <[email protected]>
joshpalis added a commit that referenced this pull request Jan 5, 2024
Adds deploy model flag support for local model registration, fixes integration tests (#350)

* Fixing local model integration test



* Added deploy model flag support for local model registration, added associated integration test



* Fixing comment



* Fixing deprovision workflow transport action, removing use of template, ascertaining deprovision sequence from created resources



* Removing rest status checks for deprovision API tests



* Increasing wait time for deprovision status



* Removing sdeprovision status checks for model deployment tests



* increasing timeout for local model registration test template



* Reverting timeout increase, setting ML Commons native memory threshold to 100 to avoid opening circuit breaker



* Passing an action listener to retryableGetMlTask



* Addressing PR comments, preserving order of resource map



* Testing if a wait time after deprovisioning will mitigate circuit breaker issues



* Increasing mlconfig index creation wait time



* Combining local model registration tests into one



* removing resource map from deprovision workflow transport action



* Fixing getResourceFromDeprovisionNOde and tests



* Separating out local model registration tests, using ml jvm heap memory setting instead of native memory heap setting



* Testing : removing second local model registration test



* Reducing model registration tests, testing local model registration with deployed flag, testing remote model registration with deploy step



* Removing suffix from simulated deploy model step



---------

Signed-off-by: Joshua Palis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x backport PRs to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Complete Register Local Model Implementation
3 participants