Skip to content

Commit

Permalink
[DOCS] Adds adaptive_allocations to inference and trained model API d…
Browse files Browse the repository at this point in the history
  • Loading branch information
szabosteve authored Aug 1, 2024
1 parent 5b88264 commit d6c5321
Show file tree
Hide file tree
Showing 5 changed files with 225 additions and 23 deletions.
48 changes: 47 additions & 1 deletion docs/reference/inference/service-elasticsearch.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,22 @@ include::inference-shared.asciidoc[tag=service-settings]
These settings are specific to the `elasticsearch` service.
--

`adaptive_allocations`:::
(Optional, object)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]

`enabled`::::
(Optional, Boolean)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]

`max_number_of_allocations`::::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]

`min_number_of_allocations`::::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]

`model_id`:::
(Required, string)
The name of the model to use for the {infer} task.
Expand All @@ -59,7 +75,9 @@ It can be the ID of either a built-in model (for example, `.multilingual-e5-smal

`num_allocations`:::
(Required, integer)
The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput.
The total number of allocations this model is assigned across machine learning nodes.
Increasing this value generally increases the throughput.
If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.

`num_threads`:::
(Required, integer)
Expand Down Expand Up @@ -137,3 +155,31 @@ PUT _inference/text_embedding/my-msmarco-minilm-model <1>
<1> Provide an unique identifier for the inference endpoint. The `inference_id` must be unique and must not match the `model_id`.
<2> The `model_id` must be the ID of a text embedding model which has already been
{ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].

[discrete]
[[inference-example-adaptive-allocation]]
==== Setting adaptive allocation for E5 via the `elasticsearch` service

The following example shows how to create an {infer} endpoint called
`my-e5-model` to perform a `text_embedding` task type and configure adaptive
allocations.

The API request below will automatically download the E5 model if it isn't
already downloaded and then deploy the model.

[source,console]
------------------------------------------------------------
PUT _inference/text_embedding/my-e5-model
{
"service": "elasticsearch",
"service_settings": {
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 3,
"max_number_of_allocations": 10
},
"model_id": ".multilingual-e5-small"
}
}
------------------------------------------------------------
// TEST[skip:TBD]
47 changes: 46 additions & 1 deletion docs/reference/inference/service-elser.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,27 @@ include::inference-shared.asciidoc[tag=service-settings]
These settings are specific to the `elser` service.
--

`adaptive_allocations`:::
(Optional, object)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]

`enabled`::::
(Optional, Boolean)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]

`max_number_of_allocations`::::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]

`min_number_of_allocations`::::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]

`num_allocations`:::
(Required, integer)
The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput.
The total number of allocations this model is assigned across machine learning nodes.
Increasing this value generally increases the throughput.
If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.

`num_threads`:::
(Required, integer)
Expand Down Expand Up @@ -107,3 +125,30 @@ This error usually just reflects a timeout, while the model downloads in the bac
You can check the download progress in the {ml-app} UI.
If using the Python client, you can set the `timeout` parameter to a higher value.
====

[discrete]
[[inference-example-elser-adaptive-allocation]]
==== Setting adaptive allocation for the ELSER service

The following example shows how to create an {infer} endpoint called
`my-elser-model` to perform a `sparse_embedding` task type and configure
adaptive allocations.

The request below will automatically download the ELSER model if it isn't
already downloaded and then deploy the model.

[source,console]
------------------------------------------------------------
PUT _inference/sparse_embedding/my-elser-model
{
"service": "elser",
"service_settings": {
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 3,
"max_number_of_allocations": 10
}
}
}
------------------------------------------------------------
// TEST[skip:TBD]
24 changes: 24 additions & 0 deletions docs/reference/ml/ml-shared.asciidoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,27 @@
tag::adaptive-allocation[]
Adaptive allocations configuration object.
If enabled, the number of allocations of the model is set based on the current load the process gets.
When the load is high, a new model allocation is automatically created (respecting the value of `max_number_of_allocations` if it's set).
When the load is low, a model allocation is automatically removed (respecting the value of `min_number_of_allocations` if it's set).
The number of model allocations cannot be scaled down to less than `1` this way.
If `adaptive_allocations` is enabled, do not set the number of allocations manually.
end::adaptive-allocation[]

tag::adaptive-allocation-enabled[]
If `true`, `adaptive_allocations` is enabled.
Defaults to `false`.
end::adaptive-allocation-enabled[]

tag::adaptive-allocation-max-number[]
Specifies the maximum number of allocations to scale to.
If set, it must be greater than or equal to `min_number_of_allocations`.
end::adaptive-allocation-max-number[]

tag::adaptive-allocation-min-number[]
Specifies the minimum number of allocations to scale to.
If set, it must be greater than or equal to `1`.
end::adaptive-allocation-min-number[]

tag::aggregations[]
If set, the {dfeed} performs aggregation searches. Support for aggregations is
limited and should be used only with low cardinality data. For more information,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@ must be unique and should not match any other deployment ID or model ID, unless
it is the same as the ID of the model being deployed. If `deployment_id` is not
set, it defaults to the `model_id`.

Scaling inference performance can be achieved by setting the parameters
You can enable adaptive allocations to automatically scale model allocations up
and down based on the actual resource requirement of the processes.

Manually scaling inference performance can be achieved by setting the parameters
`number_of_allocations` and `threads_per_allocation`.

Increasing `threads_per_allocation` means more threads are used when an
Expand Down Expand Up @@ -58,22 +61,58 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=model-id]
[[start-trained-model-deployment-query-params]]
== {api-query-parms-title}

`deployment_id`::
(Optional, string)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
+
--
Defaults to `model_id`.
--

`timeout`::
(Optional, time)
Controls the amount of time to wait for the model to deploy. Defaults to 30
seconds.

`wait_for`::
(Optional, string)
Specifies the allocation status to wait for before returning. Defaults to
`started`. The value `starting` indicates deployment is starting but not yet on
any node. The value `started` indicates the model has started on at least one
node. The value `fully_allocated` indicates the deployment has started on all
valid nodes.

[[start-trained-model-deployment-request-body]]
== {api-request-body-title}

`adaptive_allocations`::
(Optional, object)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]

`enabled`:::
(Optional, Boolean)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]

`max_number_of_allocations`:::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]

`min_number_of_allocations`:::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]

`cache_size`::
(Optional, <<byte-units,byte value>>)
The inference cache size (in memory outside the JVM heap) per node for the
model. In serverless, the cache is disabled by default. Otherwise, the default value is the size of the model as reported by the
`model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
cache, `0b` can be provided.

`deployment_id`::
(Optional, string)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
Defaults to `model_id`.

`number_of_allocations`::
(Optional, integer)
The total number of allocations this model is assigned across {ml} nodes.
Increasing this value generally increases the throughput. Defaults to 1.
Increasing this value generally increases the throughput. Defaults to `1`.
If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.

`priority`::
(Optional, string)
Expand Down Expand Up @@ -110,18 +149,6 @@ compute-bound process; `threads_per_allocations` must not exceed the number of
available allocated processors per node. Defaults to 1. Must be a power of 2.
Max allowed value is 32.

`timeout`::
(Optional, time)
Controls the amount of time to wait for the model to deploy. Defaults to 30
seconds.

`wait_for`::
(Optional, string)
Specifies the allocation status to wait for before returning. Defaults to
`started`. The value `starting` indicates deployment is starting but not yet on
any node. The value `started` indicates the model has started on at least one
node. The value `fully_allocated` indicates the deployment has started on all
valid nodes.

[[start-trained-model-deployment-example]]
== {api-examples-title}
Expand Down Expand Up @@ -182,3 +209,24 @@ The `my_model` trained model can be deployed again with a different ID:
POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
--------------------------------------------------
// TEST[skip:TBD]


[[start-trained-model-deployment-adaptive-allocation-example]]
=== Setting adaptive allocations

The following example starts a new deployment of the `my_model` trained model
with the ID `my_model_for_search` and enables adaptive allocations with the
minimum number of 3 allocations and the maximum number of 10.

[source,console]
--------------------------------------------------
POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
{
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 3,
"max_number_of_allocations": 10
}
}
--------------------------------------------------
// TEST[skip:TBD]
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@ Requires the `manage_ml` cluster privilege. This privilege is included in the
== {api-description-title}

You can update a trained model deployment whose `assignment_state` is `started`.
You can either increase or decrease the number of allocations of such a deployment.
You can enable adaptive allocations to automatically scale model allocations up
and down based on the actual resource requirement of the processes.
Or you can manually increase or decrease the number of allocations of a model
deployment.


[[update-trained-model-deployments-path-parms]]
== {api-path-parms-title}
Expand All @@ -37,17 +41,34 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
[[update-trained-model-deployment-request-body]]
== {api-request-body-title}

`adaptive_allocations`::
(Optional, object)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]

`enabled`:::
(Optional, Boolean)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]

`max_number_of_allocations`:::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]

`min_number_of_allocations`:::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]

`number_of_allocations`::
(Optional, integer)
The total number of allocations this model is assigned across {ml} nodes.
Increasing this value generally increases the throughput.
If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.


[[update-trained-model-deployment-example]]
== {api-examples-title}

The following example updates the deployment for a
`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations:
`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations:

[source,console]
--------------------------------------------------
Expand Down Expand Up @@ -84,3 +105,21 @@ The API returns the following results:
}
}
----

The following example updates the deployment for a
`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to
enable adaptive allocations with the minimum number of 3 allocations and the
maximum number of 10:

[source,console]
--------------------------------------------------
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_update
{
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 3,
"max_number_of_allocations": 10
}
}
--------------------------------------------------
// TEST[skip:TBD]

0 comments on commit d6c5321

Please sign in to comment.