[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes #201256

darnautov · 2024-11-21T17:38:44Z

Summary

During my testing, I used the current user with all required privileges but failed to notice that, after switching to the internal kibana_system user, it lacked the manage_autoscaling privilege required for the GET /_autoscaling/policy API.

As a result, the isMlAutoscalingEnabled flag, which we rely on in the Start Deployment modal, was always set to false. This caused a bug in scenarios with zero active ML nodes, where falling back to deriving available processors from ML limits was not possible.

You can check the created deployment, it correctly identifies ML autoscaling:

Also fixes restoring vCPU levels from the API deployment params.

Checklist

Check the PR satisfies following conditions.

Unit or functional tests were updated or added to match the most common scenarios

elasticmachine · 2024-11-21T17:49:28Z

Pinging @elastic/ml-ui (:ml)

jgowdyelastic

LGTM
I've added a comment about the possible redundancy of the getAutoscalingPolicy call.
If using the lazy node count is reliable, then I think we could just use that for setting isMlAutoscalingEnabled

jgowdyelastic · 2024-11-22T14:33:15Z

x-pack/plugins/ml/server/routes/system.ts

-            // If doesn't exist, then keep the false
+            // If ml autoscaling policy doesn't exist or the user does not have privileges to fetch it,
+            // check the number of lazy ml nodes to determine if autoscaling is enabled.
+            const lazyMlNodeCount = await getLazyMlNodeCount(client);


It seems like this check will always work and so we don't really need to attempt to fetch getAutoscalingPolicy as they'll produce the same results.

there are cases when this check won't work, e.g.

ML auto scaling may be disabled in the tier

We’ve hit the autoscaling limit

On prem users can set the settings, but it won't scale

So checking the autoscaling policy as a current user is still worth trying

peteharverson

Testing against your cloud deployment, the check for auto-scaling looks good.

As discussed offline, when I update a deployment created at low / medium vCPUs level, it comes up saying it was 'high'.

darnautov · 2024-11-25T12:36:54Z

@peteharverson the vCPU levels issues is fixed in 37ee935

darnautov · 2024-11-25T12:37:01Z

@elasticmachine merge upstream

darnautov · 2024-11-25T16:07:07Z

@elasticmachine merge upstream

elasticmachine · 2024-11-25T16:56:05Z

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`ml`	4.7MB	4.7MB	+154.0B

History

💔 Build #253977 failed f04a71b
💚 Build #253317 succeeded 2ec9218

cc @darnautov

peteharverson

Tested latest changes in your cloud instance and LGTM.
As discussed offline, we should aim to fix in a separate PR the lack of error messaging when a second deployment fails to start.

kibanamachine · 2024-11-26T10:33:20Z

Starting backport for target branches: 8.16, 8.17, 8.x

https://github.com/elastic/kibana/actions/runs/12028725291

…tive nodes (elastic#201256) ## Summary During my testing, I used the current user with all required privileges but failed to notice that, after switching to the internal` kibana_system` user, it lacked the manage_autoscaling privilege required for the `GET /_autoscaling/policy` API. As a result, the `isMlAutoscalingEnabled` flag, which we rely on in the Start Deployment modal, was always set to false. This caused a bug in scenarios with zero active ML nodes, where falling back to deriving available processors from ML limits was not possible. You can check the created deployment, it correctly identifies ML autoscaling: <img width="670" alt="image" src="https://github.com/user-attachments/assets/ff1f835e-2b90-4b73-bea8-a49da8846fbd"> Also fixes restoring vCPU levels from the API deployment params. ### Checklist Check the PR satisfies following conditions. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios (cherry picked from commit 9827a07)

kibanamachine · 2024-11-26T10:38:31Z

💚 All backports created successfully

Status	Branch	Result
✅	8.16
✅	8.17
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…nd 0 active nodes (#201256) (#201746) # Backport This will backport the following commits from `main` to `8.16`: - [[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes (#201256)](#201256)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Dima Arnautov <[email protected]>

…nd 0 active nodes (#201256) (#201747) # Backport This will backport the following commits from `main` to `8.17`: - [[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes (#201256)](#201256)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Dima Arnautov <[email protected]>

…d 0 active nodes (#201256) (#201748) # Backport This will backport the following commits from `main` to `8.x`: - [[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes (#201256)](#201256)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Dima Arnautov <[email protected]>

…tive nodes (elastic#201256) ## Summary During my testing, I used the current user with all required privileges but failed to notice that, after switching to the internal` kibana_system` user, it lacked the manage_autoscaling privilege required for the `GET /_autoscaling/policy` API. As a result, the `isMlAutoscalingEnabled` flag, which we rely on in the Start Deployment modal, was always set to false. This caused a bug in scenarios with zero active ML nodes, where falling back to deriving available processors from ML limits was not possible. You can check the created deployment, it correctly identifies ML autoscaling: <img width="670" alt="image" src="https://github.com/user-attachments/assets/ff1f835e-2b90-4b73-bea8-a49da8846fbd"> Also fixes restoring vCPU levels from the API deployment params. ### Checklist Check the PR satisfies following conditions. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

fallback to lazy nodes count to determine if ml autoscaling is enabled

2ec9218

darnautov self-assigned this Nov 21, 2024

darnautov added :ml v9.0.0 Feature:3rd Party Models ML 3rd party models labels Nov 21, 2024

darnautov requested a review from peteharverson November 21, 2024 17:38

darnautov added Team:ML Team label for ML (also use :ml) backport:version Backport to applied version labels v8.17.0 labels Nov 21, 2024

darnautov requested a review from jgowdyelastic November 21, 2024 17:38

darnautov added v8.18.0 v8.16.2 ci:cloud-deploy Create or update a Cloud deployment labels Nov 21, 2024

darnautov marked this pull request as ready for review November 21, 2024 17:49

darnautov requested a review from a team as a code owner November 21, 2024 17:49

darnautov added the release_note:fix label Nov 21, 2024

jgowdyelastic approved these changes Nov 22, 2024

View reviewed changes

peteharverson reviewed Nov 22, 2024

View reviewed changes

darnautov added 2 commits November 25, 2024 13:29

fix resolving levels from api params

37ee935

refactor

c5b821d

darnautov requested a review from peteharverson November 25, 2024 12:35

Merge branch 'main' into ml-fix-autoscaling-check

f04a71b

Merge branch 'main' into ml-fix-autoscaling-check

01d8595

peteharverson mentioned this pull request Nov 25, 2024

[ML] Increase Test Coverage 8.17.0 #197420

Closed

8 tasks

darnautov added the ci:cloud-redeploy Always create a new Cloud deployment label Nov 25, 2024

peteharverson approved these changes Nov 26, 2024

View reviewed changes

darnautov merged commit 9827a07 into elastic:main Nov 26, 2024
24 checks passed

kibanamachine mentioned this pull request Nov 26, 2024

[8.16] [ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes (#201256) #201746

Merged

kibanamachine mentioned this pull request Nov 26, 2024

[8.17] [ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes (#201256) #201747

Merged

kibanamachine mentioned this pull request Nov 26, 2024

[8.x] [ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes (#201256) #201748

Merged

darnautov deleted the ml-fix-autoscaling-check branch December 6, 2024 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes #201256

[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes #201256

darnautov commented Nov 21, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Nov 21, 2024

jgowdyelastic left a comment

jgowdyelastic Nov 22, 2024

darnautov Nov 25, 2024

peteharverson left a comment

darnautov commented Nov 25, 2024

darnautov commented Nov 25, 2024

darnautov commented Nov 25, 2024

elasticmachine commented Nov 25, 2024 •

edited

Loading

peteharverson left a comment

kibanamachine commented Nov 26, 2024

kibanamachine commented Nov 26, 2024

[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes #201256

[ML] Trained Model: Fix start deployment with ML autoscaling and 0 active nodes #201256

Conversation

darnautov commented Nov 21, 2024 • edited by kibanamachine Loading

Summary

Checklist

elasticmachine commented Nov 21, 2024

jgowdyelastic left a comment

Choose a reason for hiding this comment

jgowdyelastic Nov 22, 2024

Choose a reason for hiding this comment

darnautov Nov 25, 2024

Choose a reason for hiding this comment

peteharverson left a comment

Choose a reason for hiding this comment

darnautov commented Nov 25, 2024

darnautov commented Nov 25, 2024

darnautov commented Nov 25, 2024

elasticmachine commented Nov 25, 2024 • edited Loading

💚 Build Succeeded

Metrics [docs]

Async chunks

History

peteharverson left a comment

Choose a reason for hiding this comment

kibanamachine commented Nov 26, 2024

kibanamachine commented Nov 26, 2024

💚 All backports created successfully

Questions ?

darnautov commented Nov 21, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Nov 25, 2024 •

edited

Loading