diff --git a/docs/reference/inference/service-elasticsearch.asciidoc b/docs/reference/inference/service-elasticsearch.asciidoc
index 6fb0b4a38d0ef..99fd41ee2db65 100644
--- a/docs/reference/inference/service-elasticsearch.asciidoc
+++ b/docs/reference/inference/service-elasticsearch.asciidoc
@@ -51,6 +51,22 @@ include::inference-shared.asciidoc[tag=service-settings]
 These settings are specific to the `elasticsearch` service.
 --
 
+`adaptive_allocations`:::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`::::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
+
 `model_id`:::
 (Required, string)
 The name of the model to use for the {infer} task.
@@ -59,7 +75,9 @@ It can be the ID of either a built-in model (for example, `.multilingual-e5-smal
 
 `num_allocations`:::
 (Required, integer)
-The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput.
+The total number of allocations this model is assigned across machine learning nodes.
+Increasing this value generally increases the throughput.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 `num_threads`:::
 (Required, integer)
@@ -137,3 +155,31 @@ PUT _inference/text_embedding/my-msmarco-minilm-model <1>
 <1> Provide an unique identifier for the inference endpoint. The `inference_id` must be unique and must not match the `model_id`.
 <2> The `model_id` must be the ID of a text embedding model which has already been
 {ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].
+
+[discrete]
+[[inference-example-adaptive-allocation]]
+==== Setting adaptive allocation for E5 via the `elasticsearch` service
+
+The following example shows how to create an {infer} endpoint called
+`my-e5-model` to perform a `text_embedding` task type and configure adaptive
+allocations.
+
+The API request below will automatically download the E5 model if it isn't
+already downloaded and then deploy the model.
+
+[source,console]
+------------------------------------------------------------
+PUT _inference/text_embedding/my-e5-model
+{
+  "service": "elasticsearch",
+  "service_settings": {
+    "adaptive_allocations": {
+      "enabled": true,
+      "min_number_of_allocations": 3,
+      "max_number_of_allocations": 10
+    },
+    "model_id": ".multilingual-e5-small"
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
\ No newline at end of file
diff --git a/docs/reference/inference/service-elser.asciidoc b/docs/reference/inference/service-elser.asciidoc
index 34c0f7d0a9c53..fdce94901984b 100644
--- a/docs/reference/inference/service-elser.asciidoc
+++ b/docs/reference/inference/service-elser.asciidoc
@@ -48,9 +48,27 @@ include::inference-shared.asciidoc[tag=service-settings]
 These settings are specific to the `elser` service.
 --
 
+`adaptive_allocations`:::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`::::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] 
+
 `num_allocations`:::
 (Required, integer)
-The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput.
+The total number of allocations this model is assigned across machine learning nodes.
+Increasing this value generally increases the throughput.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 `num_threads`:::
 (Required, integer)
@@ -107,3 +125,30 @@ This error usually just reflects a timeout, while the model downloads in the bac
 You can check the download progress in the {ml-app} UI.
 If using the Python client, you can set the `timeout` parameter to a higher value.
 ====
+
+[discrete]
+[[inference-example-elser-adaptive-allocation]]
+==== Setting adaptive allocation for the ELSER service
+
+The following example shows how to create an {infer} endpoint called
+`my-elser-model` to perform a `sparse_embedding` task type and configure
+adaptive allocations.
+
+The request below will automatically download the ELSER model if it isn't
+already downloaded and then deploy the model.
+
+[source,console]
+------------------------------------------------------------
+PUT _inference/sparse_embedding/my-elser-model
+{
+  "service": "elser",
+  "service_settings": {
+    "adaptive_allocations": {
+      "enabled": true,
+      "min_number_of_allocations": 3,
+      "max_number_of_allocations": 10
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
\ No newline at end of file
diff --git a/docs/reference/ml/ml-shared.asciidoc b/docs/reference/ml/ml-shared.asciidoc
index a69fd2f1812e9..15a994115c88c 100644
--- a/docs/reference/ml/ml-shared.asciidoc
+++ b/docs/reference/ml/ml-shared.asciidoc
@@ -1,3 +1,27 @@
+tag::adaptive-allocation[]
+Adaptive allocations configuration object.
+If enabled, the number of allocations of the model is set based on the current load the process gets.
+When the load is high, a new model allocation is automatically created (respecting the value of `max_number_of_allocations` if it's set).
+When the load is low, a model allocation is automatically removed (respecting the value of `min_number_of_allocations` if it's set).
+The number of model allocations cannot be scaled down to less than `1` this way.
+If `adaptive_allocations` is enabled, do not set the number of allocations manually.
+end::adaptive-allocation[]
+
+tag::adaptive-allocation-enabled[]
+If `true`, `adaptive_allocations` is enabled.
+Defaults to `false`.
+end::adaptive-allocation-enabled[]
+
+tag::adaptive-allocation-max-number[]
+Specifies the maximum number of allocations to scale to.
+If set, it must be greater than or equal to `min_number_of_allocations`.
+end::adaptive-allocation-max-number[]
+
+tag::adaptive-allocation-min-number[]
+Specifies the minimum number of allocations to scale to.
+If set, it must be greater than or equal to `1`.
+end::adaptive-allocation-min-number[]
+
 tag::aggregations[]
 If set, the {dfeed} performs aggregation searches. Support for aggregations is
 limited and should be used only with low cardinality data. For more information,
diff --git a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
index f1b3fffb8a9a2..6f7e2a4d9f988 100644
--- a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
+++ b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
@@ -30,7 +30,10 @@ must be unique and should not match any other deployment ID or model ID, unless
 it is the same as the ID of the model being deployed. If `deployment_id` is not
 set, it defaults to the `model_id`.
 
-Scaling inference performance can be achieved by setting the parameters
+You can enable adaptive allocations to automatically scale model allocations up
+and down based on the actual resource requirement of the processes.
+
+Manually scaling inference performance can be achieved by setting the parameters
 `number_of_allocations` and `threads_per_allocation`.
 
 Increasing `threads_per_allocation` means more threads are used when an
@@ -58,6 +61,46 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=model-id]
 [[start-trained-model-deployment-query-params]]
 == {api-query-parms-title}
 
+`deployment_id`::
+(Optional, string)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
++
+--
+Defaults to `model_id`.
+--
+
+`timeout`::
+(Optional, time)
+Controls the amount of time to wait for the model to deploy. Defaults to 30
+seconds.
+
+`wait_for`::
+(Optional, string)
+Specifies the allocation status to wait for before returning. Defaults to
+`started`. The value `starting` indicates deployment is starting but not yet on
+any node. The value `started` indicates the model has started on at least one
+node. The value `fully_allocated` indicates the deployment has started on all
+valid nodes.
+
+[[start-trained-model-deployment-request-body]]
+== {api-request-body-title}
+
+`adaptive_allocations`::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`:::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
+
 `cache_size`::
 (Optional, <<byte-units,byte value>>)
 The inference cache size (in memory outside the JVM heap) per node for the
@@ -65,15 +108,11 @@ model. In serverless, the cache is disabled by default. Otherwise, the default v
 `model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
 cache, `0b` can be provided.
 
-`deployment_id`::
-(Optional, string)
-include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
-Defaults to `model_id`.
-
 `number_of_allocations`::
 (Optional, integer)
 The total number of allocations this model is assigned across {ml} nodes.
-Increasing this value generally increases the throughput. Defaults to 1.
+Increasing this value generally increases the throughput. Defaults to `1`.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 `priority`::
 (Optional, string)
@@ -110,18 +149,6 @@ compute-bound process; `threads_per_allocations` must not exceed the number of
 available allocated processors per node. Defaults to 1. Must be a power of 2.
 Max allowed value is 32.
 
-`timeout`::
-(Optional, time)
-Controls the amount of time to wait for the model to deploy. Defaults to 30
-seconds.
-
-`wait_for`::
-(Optional, string)
-Specifies the allocation status to wait for before returning. Defaults to
-`started`. The value `starting` indicates deployment is starting but not yet on
-any node. The value `started` indicates the model has started on at least one
-node. The value `fully_allocated` indicates the deployment has started on all
-valid nodes.
 
 [[start-trained-model-deployment-example]]
 == {api-examples-title}
@@ -182,3 +209,24 @@ The `my_model` trained model can be deployed again with a different ID:
 POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
 --------------------------------------------------
 // TEST[skip:TBD]
+
+
+[[start-trained-model-deployment-adaptive-allocation-example]]
+=== Setting adaptive allocations
+
+The following example starts a new deployment of the `my_model` trained model
+with the ID `my_model_for_search` and enables adaptive allocations with the
+minimum number of 3 allocations and the maximum number of 10. 
+
+[source,console]
+--------------------------------------------------
+POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
+{
+  "adaptive_allocations": {
+    "enabled": true,
+    "min_number_of_allocations": 3,
+    "max_number_of_allocations": 10
+  }
+}
+--------------------------------------------------
+// TEST[skip:TBD]
\ No newline at end of file
diff --git a/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc
index ea5508fac26dd..d49ee3c6e872c 100644
--- a/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc
+++ b/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc
@@ -25,7 +25,11 @@ Requires the `manage_ml` cluster privilege. This privilege is included in the
 == {api-description-title}
 
 You can update a trained model deployment whose `assignment_state` is `started`.
-You can either increase or decrease the number of allocations of such a deployment.
+You can enable adaptive allocations to automatically scale model allocations up
+and down based on the actual resource requirement of the processes.
+Or you can manually increase or decrease the number of allocations of a model
+deployment.
+
 
 [[update-trained-model-deployments-path-parms]]
 == {api-path-parms-title}
@@ -37,17 +41,34 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
 [[update-trained-model-deployment-request-body]]
 == {api-request-body-title}
 
+`adaptive_allocations`::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`:::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
+
 `number_of_allocations`::
 (Optional, integer)
 The total number of allocations this model is assigned across {ml} nodes.
 Increasing this value generally increases the throughput.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 
 [[update-trained-model-deployment-example]]
 == {api-examples-title}
 
 The following example updates the deployment for a
- `elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations:
+`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations:
 
 [source,console]
 --------------------------------------------------
@@ -84,3 +105,21 @@ The API returns the following results:
     }
 }
 ----
+
+The following example updates the deployment for a
+`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to
+enable adaptive allocations with the minimum number of 3 allocations and the
+maximum number of 10:
+
+[source,console]
+--------------------------------------------------
+POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_update
+{
+  "adaptive_allocations": {
+    "enabled": true,
+    "min_number_of_allocations": 3,
+    "max_number_of_allocations": 10
+  }
+}
+--------------------------------------------------
+// TEST[skip:TBD]
\ No newline at end of file