diff --git a/docs/reference/inference/service-elasticsearch.asciidoc b/docs/reference/inference/service-elasticsearch.asciidoc index 6fb0b4a38d0ef..99fd41ee2db65 100644 --- a/docs/reference/inference/service-elasticsearch.asciidoc +++ b/docs/reference/inference/service-elasticsearch.asciidoc @@ -51,6 +51,22 @@ include::inference-shared.asciidoc[tag=service-settings] These settings are specific to the `elasticsearch` service. -- +`adaptive_allocations`::: +(Optional, object) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation] + +`enabled`:::: +(Optional, Boolean) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled] + +`max_number_of_allocations`:::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number] + +`min_number_of_allocations`:::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] + `model_id`::: (Required, string) The name of the model to use for the {infer} task. @@ -59,7 +75,9 @@ It can be the ID of either a built-in model (for example, `.multilingual-e5-smal `num_allocations`::: (Required, integer) -The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. +The total number of allocations this model is assigned across machine learning nodes. +Increasing this value generally increases the throughput. +If `adaptive_allocations` is enabled, do not set this value, because it's automatically set. `num_threads`::: (Required, integer) @@ -137,3 +155,31 @@ PUT _inference/text_embedding/my-msmarco-minilm-model <1> <1> Provide an unique identifier for the inference endpoint. The `inference_id` must be unique and must not match the `model_id`. <2> The `model_id` must be the ID of a text embedding model which has already been {ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland]. + +[discrete] +[[inference-example-adaptive-allocation]] +==== Setting adaptive allocation for E5 via the `elasticsearch` service + +The following example shows how to create an {infer} endpoint called +`my-e5-model` to perform a `text_embedding` task type and configure adaptive +allocations. + +The API request below will automatically download the E5 model if it isn't +already downloaded and then deploy the model. + +[source,console] +------------------------------------------------------------ +PUT _inference/text_embedding/my-e5-model +{ + "service": "elasticsearch", + "service_settings": { + "adaptive_allocations": { + "enabled": true, + "min_number_of_allocations": 3, + "max_number_of_allocations": 10 + }, + "model_id": ".multilingual-e5-small" + } +} +------------------------------------------------------------ +// TEST[skip:TBD] \ No newline at end of file diff --git a/docs/reference/inference/service-elser.asciidoc b/docs/reference/inference/service-elser.asciidoc index 34c0f7d0a9c53..fdce94901984b 100644 --- a/docs/reference/inference/service-elser.asciidoc +++ b/docs/reference/inference/service-elser.asciidoc @@ -48,9 +48,27 @@ include::inference-shared.asciidoc[tag=service-settings] These settings are specific to the `elser` service. -- +`adaptive_allocations`::: +(Optional, object) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation] + +`enabled`:::: +(Optional, Boolean) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled] + +`max_number_of_allocations`:::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number] + +`min_number_of_allocations`:::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] + `num_allocations`::: (Required, integer) -The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. +The total number of allocations this model is assigned across machine learning nodes. +Increasing this value generally increases the throughput. +If `adaptive_allocations` is enabled, do not set this value, because it's automatically set. `num_threads`::: (Required, integer) @@ -107,3 +125,30 @@ This error usually just reflects a timeout, while the model downloads in the bac You can check the download progress in the {ml-app} UI. If using the Python client, you can set the `timeout` parameter to a higher value. ==== + +[discrete] +[[inference-example-elser-adaptive-allocation]] +==== Setting adaptive allocation for the ELSER service + +The following example shows how to create an {infer} endpoint called +`my-elser-model` to perform a `sparse_embedding` task type and configure +adaptive allocations. + +The request below will automatically download the ELSER model if it isn't +already downloaded and then deploy the model. + +[source,console] +------------------------------------------------------------ +PUT _inference/sparse_embedding/my-elser-model +{ + "service": "elser", + "service_settings": { + "adaptive_allocations": { + "enabled": true, + "min_number_of_allocations": 3, + "max_number_of_allocations": 10 + } + } +} +------------------------------------------------------------ +// TEST[skip:TBD] \ No newline at end of file diff --git a/docs/reference/ml/ml-shared.asciidoc b/docs/reference/ml/ml-shared.asciidoc index a69fd2f1812e9..15a994115c88c 100644 --- a/docs/reference/ml/ml-shared.asciidoc +++ b/docs/reference/ml/ml-shared.asciidoc @@ -1,3 +1,27 @@ +tag::adaptive-allocation[] +Adaptive allocations configuration object. +If enabled, the number of allocations of the model is set based on the current load the process gets. +When the load is high, a new model allocation is automatically created (respecting the value of `max_number_of_allocations` if it's set). +When the load is low, a model allocation is automatically removed (respecting the value of `min_number_of_allocations` if it's set). +The number of model allocations cannot be scaled down to less than `1` this way. +If `adaptive_allocations` is enabled, do not set the number of allocations manually. +end::adaptive-allocation[] + +tag::adaptive-allocation-enabled[] +If `true`, `adaptive_allocations` is enabled. +Defaults to `false`. +end::adaptive-allocation-enabled[] + +tag::adaptive-allocation-max-number[] +Specifies the maximum number of allocations to scale to. +If set, it must be greater than or equal to `min_number_of_allocations`. +end::adaptive-allocation-max-number[] + +tag::adaptive-allocation-min-number[] +Specifies the minimum number of allocations to scale to. +If set, it must be greater than or equal to `1`. +end::adaptive-allocation-min-number[] + tag::aggregations[] If set, the {dfeed} performs aggregation searches. Support for aggregations is limited and should be used only with low cardinality data. For more information, diff --git a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc index f1b3fffb8a9a2..6f7e2a4d9f988 100644 --- a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc +++ b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc @@ -30,7 +30,10 @@ must be unique and should not match any other deployment ID or model ID, unless it is the same as the ID of the model being deployed. If `deployment_id` is not set, it defaults to the `model_id`. -Scaling inference performance can be achieved by setting the parameters +You can enable adaptive allocations to automatically scale model allocations up +and down based on the actual resource requirement of the processes. + +Manually scaling inference performance can be achieved by setting the parameters `number_of_allocations` and `threads_per_allocation`. Increasing `threads_per_allocation` means more threads are used when an @@ -58,6 +61,46 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=model-id] [[start-trained-model-deployment-query-params]] == {api-query-parms-title} +`deployment_id`:: +(Optional, string) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id] ++ +-- +Defaults to `model_id`. +-- + +`timeout`:: +(Optional, time) +Controls the amount of time to wait for the model to deploy. Defaults to 30 +seconds. + +`wait_for`:: +(Optional, string) +Specifies the allocation status to wait for before returning. Defaults to +`started`. The value `starting` indicates deployment is starting but not yet on +any node. The value `started` indicates the model has started on at least one +node. The value `fully_allocated` indicates the deployment has started on all +valid nodes. + +[[start-trained-model-deployment-request-body]] +== {api-request-body-title} + +`adaptive_allocations`:: +(Optional, object) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation] + +`enabled`::: +(Optional, Boolean) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled] + +`max_number_of_allocations`::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number] + +`min_number_of_allocations`::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] + `cache_size`:: (Optional, <>) The inference cache size (in memory outside the JVM heap) per node for the @@ -65,15 +108,11 @@ model. In serverless, the cache is disabled by default. Otherwise, the default v `model_size_bytes` field in the <>. To disable the cache, `0b` can be provided. -`deployment_id`:: -(Optional, string) -include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id] -Defaults to `model_id`. - `number_of_allocations`:: (Optional, integer) The total number of allocations this model is assigned across {ml} nodes. -Increasing this value generally increases the throughput. Defaults to 1. +Increasing this value generally increases the throughput. Defaults to `1`. +If `adaptive_allocations` is enabled, do not set this value, because it's automatically set. `priority`:: (Optional, string) @@ -110,18 +149,6 @@ compute-bound process; `threads_per_allocations` must not exceed the number of available allocated processors per node. Defaults to 1. Must be a power of 2. Max allowed value is 32. -`timeout`:: -(Optional, time) -Controls the amount of time to wait for the model to deploy. Defaults to 30 -seconds. - -`wait_for`:: -(Optional, string) -Specifies the allocation status to wait for before returning. Defaults to -`started`. The value `starting` indicates deployment is starting but not yet on -any node. The value `started` indicates the model has started on at least one -node. The value `fully_allocated` indicates the deployment has started on all -valid nodes. [[start-trained-model-deployment-example]] == {api-examples-title} @@ -182,3 +209,24 @@ The `my_model` trained model can be deployed again with a different ID: POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search -------------------------------------------------- // TEST[skip:TBD] + + +[[start-trained-model-deployment-adaptive-allocation-example]] +=== Setting adaptive allocations + +The following example starts a new deployment of the `my_model` trained model +with the ID `my_model_for_search` and enables adaptive allocations with the +minimum number of 3 allocations and the maximum number of 10. + +[source,console] +-------------------------------------------------- +POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search +{ + "adaptive_allocations": { + "enabled": true, + "min_number_of_allocations": 3, + "max_number_of_allocations": 10 + } +} +-------------------------------------------------- +// TEST[skip:TBD] \ No newline at end of file diff --git a/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc index ea5508fac26dd..d49ee3c6e872c 100644 --- a/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc +++ b/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc @@ -25,7 +25,11 @@ Requires the `manage_ml` cluster privilege. This privilege is included in the == {api-description-title} You can update a trained model deployment whose `assignment_state` is `started`. -You can either increase or decrease the number of allocations of such a deployment. +You can enable adaptive allocations to automatically scale model allocations up +and down based on the actual resource requirement of the processes. +Or you can manually increase or decrease the number of allocations of a model +deployment. + [[update-trained-model-deployments-path-parms]] == {api-path-parms-title} @@ -37,17 +41,34 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id] [[update-trained-model-deployment-request-body]] == {api-request-body-title} +`adaptive_allocations`:: +(Optional, object) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation] + +`enabled`::: +(Optional, Boolean) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled] + +`max_number_of_allocations`::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number] + +`min_number_of_allocations`::: +(Optional, integer) +include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] + `number_of_allocations`:: (Optional, integer) The total number of allocations this model is assigned across {ml} nodes. Increasing this value generally increases the throughput. +If `adaptive_allocations` is enabled, do not set this value, because it's automatically set. [[update-trained-model-deployment-example]] == {api-examples-title} The following example updates the deployment for a - `elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations: +`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations: [source,console] -------------------------------------------------- @@ -84,3 +105,21 @@ The API returns the following results: } } ---- + +The following example updates the deployment for a +`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to +enable adaptive allocations with the minimum number of 3 allocations and the +maximum number of 10: + +[source,console] +-------------------------------------------------- +POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_update +{ + "adaptive_allocations": { + "enabled": true, + "min_number_of_allocations": 3, + "max_number_of_allocations": 10 + } +} +-------------------------------------------------- +// TEST[skip:TBD] \ No newline at end of file