Skip to content

Commit

Permalink
[8.17] [ML] Trained Model: Fix start deployment with ML autoscaling a…
Browse files Browse the repository at this point in the history
…nd 0 active nodes (#201256) (#201747)

# Backport

This will backport the following commits from `main` to `8.17`:
- [[ML] Trained Model: Fix start deployment with ML autoscaling and 0
active nodes (#201256)](#201256)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Dima
Arnautov","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-26T10:33:04Z","message":"[ML]
Trained Model: Fix start deployment with ML autoscaling and 0 active
nodes (#201256)\n\n## Summary\r\n\r\nDuring my testing, I used the
current user with all required privileges\r\nbut failed to notice that,
after switching to the internal`\r\nkibana_system` user, it lacked the
manage_autoscaling privilege required\r\nfor the `GET
/_autoscaling/policy` API.\r\n\r\nAs a result, the
`isMlAutoscalingEnabled` flag, which we rely on in the\r\nStart
Deployment modal, was always set to false. This caused a bug
in\r\nscenarios with zero active ML nodes, where falling back to
deriving\r\navailable processors from ML limits was not
possible.\r\n\r\n\r\nYou can check the created deployment, it correctly
identifies ML\r\nautoscaling:\r\n\r\n<img width=\"670\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/ff1f835e-2b90-4b73-bea8-a49da8846fbd\">\r\n\r\n\r\nAlso
fixes restoring vCPU levels from the API deployment params.\r\n\r\n###
Checklist\r\n\r\nCheck the PR satisfies following conditions. \r\n\r\n-
[x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"9827a07b5891d643a61a53e09350ff6e4ab25889","branchLabelMapping":{"^v9.0.0$":"main","^v8.18.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix",":ml","v9.0.0","Feature:3rd
Party
Models","Team:ML","ci:cloud-deploy","ci:cloud-redeploy","backport:version","v8.17.0","v8.18.0","v8.16.2"],"title":"[ML]
Trained Model: Fix start deployment with ML autoscaling and 0 active
nodes
","number":201256,"url":"https://github.com/elastic/kibana/pull/201256","mergeCommit":{"message":"[ML]
Trained Model: Fix start deployment with ML autoscaling and 0 active
nodes (#201256)\n\n## Summary\r\n\r\nDuring my testing, I used the
current user with all required privileges\r\nbut failed to notice that,
after switching to the internal`\r\nkibana_system` user, it lacked the
manage_autoscaling privilege required\r\nfor the `GET
/_autoscaling/policy` API.\r\n\r\nAs a result, the
`isMlAutoscalingEnabled` flag, which we rely on in the\r\nStart
Deployment modal, was always set to false. This caused a bug
in\r\nscenarios with zero active ML nodes, where falling back to
deriving\r\navailable processors from ML limits was not
possible.\r\n\r\n\r\nYou can check the created deployment, it correctly
identifies ML\r\nautoscaling:\r\n\r\n<img width=\"670\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/ff1f835e-2b90-4b73-bea8-a49da8846fbd\">\r\n\r\n\r\nAlso
fixes restoring vCPU levels from the API deployment params.\r\n\r\n###
Checklist\r\n\r\nCheck the PR satisfies following conditions. \r\n\r\n-
[x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"9827a07b5891d643a61a53e09350ff6e4ab25889"}},"sourceBranch":"main","suggestedTargetBranches":["8.17","8.x","8.16"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/201256","number":201256,"mergeCommit":{"message":"[ML]
Trained Model: Fix start deployment with ML autoscaling and 0 active
nodes (#201256)\n\n## Summary\r\n\r\nDuring my testing, I used the
current user with all required privileges\r\nbut failed to notice that,
after switching to the internal`\r\nkibana_system` user, it lacked the
manage_autoscaling privilege required\r\nfor the `GET
/_autoscaling/policy` API.\r\n\r\nAs a result, the
`isMlAutoscalingEnabled` flag, which we rely on in the\r\nStart
Deployment modal, was always set to false. This caused a bug
in\r\nscenarios with zero active ML nodes, where falling back to
deriving\r\navailable processors from ML limits was not
possible.\r\n\r\n\r\nYou can check the created deployment, it correctly
identifies ML\r\nautoscaling:\r\n\r\n<img width=\"670\"
alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/ff1f835e-2b90-4b73-bea8-a49da8846fbd\">\r\n\r\n\r\nAlso
fixes restoring vCPU levels from the API deployment params.\r\n\r\n###
Checklist\r\n\r\nCheck the PR satisfies following conditions. \r\n\r\n-
[x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"9827a07b5891d643a61a53e09350ff6e4ab25889"}},{"branch":"8.17","label":"v8.17.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.x","label":"v8.18.0","branchLabelMappingKey":"^v8.18.0$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.16","label":"v8.16.2","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Dima Arnautov <[email protected]>
  • Loading branch information
kibanamachine and darnautov authored Nov 26, 2024
1 parent 62c1d11 commit 5f5667a
Show file tree
Hide file tree
Showing 4 changed files with 315 additions and 11 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -627,6 +627,297 @@ describe('DeploymentParamsMapper', () => {
},
});
});

describe('mapApiToUiDeploymentParams', () => {
it('should map API params to UI correctly', () => {
// Optimized for search
expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 16,
number_of_allocations: 2,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'medium',
});

// Lower value
expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 16,
number_of_allocations: 1,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'medium',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 8,
number_of_allocations: 2,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'medium',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 2,
number_of_allocations: 1,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'low',
});

// Exact match
expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 16,
number_of_allocations: 8,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'high',
});

// Higher value
expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 16,
number_of_allocations: 12,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'high',
});

// Lower value
expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 16,
number_of_allocations: 5,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'high',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 16,
number_of_allocations: 6,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: false,
vCPUUsage: 'high',
});

// Optimized for ingest
expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 1,
number_of_allocations: 1,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForIngest',
adaptiveResources: false,
vCPUUsage: 'low',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 1,
number_of_allocations: 2,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForIngest',
adaptiveResources: false,
vCPUUsage: 'low',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 1,
number_of_allocations: 6,
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForIngest',
adaptiveResources: false,
vCPUUsage: 'medium',
});
});

it('should map API params to UI correctly with adaptive resources', () => {
expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 8,
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 2,
max_number_of_allocations: 2,
},
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: true,
vCPUUsage: 'medium',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 2,
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 2,
max_number_of_allocations: 2,
},
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: true,
vCPUUsage: 'medium',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 1,
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 1,
max_number_of_allocations: 1,
},
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForIngest',
adaptiveResources: true,
vCPUUsage: 'low',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 2,
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 0,
max_number_of_allocations: 1,
},
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: true,
vCPUUsage: 'low',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 1,
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 0,
max_number_of_allocations: 64,
},
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForIngest',
adaptiveResources: true,
vCPUUsage: 'high',
});

expect(
mapper.mapApiToUiDeploymentParams({
model_id: modelId,
deployment_id: 'test-deployment',
priority: 'normal',
threads_per_allocation: 16,
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 0,
max_number_of_allocations: 12,
},
} as unknown as MlTrainedModelAssignmentTaskParametersAdaptive)
).toEqual({
deploymentId: 'test-deployment',
optimized: 'optimizedForSearch',
adaptiveResources: true,
vCPUUsage: 'high',
});
});
});
});
});
});
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ type VCPUBreakpoints = Record<
max: number;
/**
* Static value is used for the number of vCPUs when the adaptive resources are disabled.
* Not allowed in certain environments.
* Not allowed in certain environments, Obs and Security serverless projects.
*/
static?: number;
}
Expand Down Expand Up @@ -89,6 +89,7 @@ export class DeploymentParamsMapper {
) {
/**
* Initial value can be different for serverless and ESS with autoscaling.
* Also not available with 0 ML active nodes.
*/
const maxSingleMlNodeProcessors = this.mlServerLimits.max_single_ml_node_processors;

Expand Down Expand Up @@ -236,18 +237,25 @@ export class DeploymentParamsMapper {
? input.adaptive_allocations!.max_number_of_allocations!
: input.number_of_allocations);

// The deployment can be created via API with a number of allocations that do not exactly match our vCPU ranges.
// In this case, we should find the closest vCPU range that does not exceed the max or static value of the range.
const [vCPUUsage] = Object.entries(this.vCpuBreakpoints)
.reverse()
.find(([key, val]) => vCPUs >= val.min) as [
DeploymentParamsUI['vCPUUsage'],
{ min: number; max: number }
];
.filter(([, range]) => vCPUs <= (adaptiveResources ? range.max : range.static!))
.reduce(
(prev, curr) => {
const prevValue = adaptiveResources ? prev[1].max : prev[1].static!;
const currValue = adaptiveResources ? curr[1].max : curr[1].static!;
return Math.abs(vCPUs - prevValue) <= Math.abs(vCPUs - currValue) ? prev : curr;
},
// in case allocation params exceed the max value of the high range
['high', this.vCpuBreakpoints.high]
);

return {
deploymentId: input.deployment_id,
optimized,
adaptiveResources,
vCPUUsage,
vCPUUsage: vCPUUsage as DeploymentParamsUI['vCPUUsage'],
};
}
}
Expand Down
2 changes: 1 addition & 1 deletion x-pack/plugins/ml/server/lib/node_utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ export async function getMlNodeCount(client: IScopedClusterClient): Promise<MlNo
return { count, lazyNodeCount };
}

export async function getLazyMlNodeCount(client: IScopedClusterClient) {
export async function getLazyMlNodeCount(client: IScopedClusterClient): Promise<number> {
const body = await client.asInternalUser.cluster.getSettings(
{
include_defaults: true,
Expand Down
11 changes: 8 additions & 3 deletions x-pack/plugins/ml/server/routes/system.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import { mlLog } from '../lib/log';
import { capabilitiesProvider } from '../lib/capabilities';
import { spacesUtilsProvider } from '../lib/spaces_utils';
import type { RouteInitialization, SystemRouteDeps } from '../types';
import { getMlNodeCount } from '../lib/node_utils';
import { getLazyMlNodeCount, getMlNodeCount } from '../lib/node_utils';

/**
* System routes
Expand Down Expand Up @@ -187,10 +187,15 @@ export function systemRoutes(

let isMlAutoscalingEnabled = false;
try {
await client.asInternalUser.autoscaling.getAutoscalingPolicy({ name: 'ml' });
// kibana_system user does not have the manage_autoscaling cluster privilege.
// perform this check as a current user.
await client.asCurrentUser.autoscaling.getAutoscalingPolicy({ name: 'ml' });
isMlAutoscalingEnabled = true;
} catch (e) {
// If doesn't exist, then keep the false
// If ml autoscaling policy doesn't exist or the user does not have privileges to fetch it,
// check the number of lazy ml nodes to determine if autoscaling is enabled.
const lazyMlNodeCount = await getLazyMlNodeCount(client);
isMlAutoscalingEnabled = lazyMlNodeCount > 0;
}

return response.ok({
Expand Down

0 comments on commit 5f5667a

Please sign in to comment.