Skip to content

Commit

Permalink
Improves trained model autoscaling docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
szabosteve committed Oct 17, 2024
1 parent a52fc2a commit 6232894
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 10 deletions.
15 changes: 9 additions & 6 deletions docs/en/stack/ml/nlp/ml-nlp-autoscaling.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
= Trained model autoscaling

You can enable autoscaling for each of your trained model deployments.
Autoscaling allows {es} to automatically adjust the resources the deployment can use based on the workload demand.
Autoscaling allows {es} to automatically adjust the resources the model deployment can use based on the workload demand.

There are two ways to enable autoscaling:

* through APIs by enabling adaptive allocations
* in {kib} by enabling adaptive resources

IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[deployment autoscaling].
IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[{es} deployment autoscaling].


[discrete]
Expand All @@ -25,6 +25,7 @@ This can help you to manage performance and cost more easily.
When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
When the load is high, a new model allocation is automatically created.
When the load is low, a model allocation is automatically removed.
You must explicitely set the minimum and maximum number of allocations; autoscaling will occur within these limits.

You can enable adaptive allocations by using:

Expand All @@ -35,7 +36,7 @@ If the new allocations fit on the current {ml} nodes, they are immediately start
If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
The number of model allocations can be scaled down to 0.
They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
Adaptive allocations must be set up independently for each deployment and {infer} endpoint.
Adaptive allocations must be set up independently for each deployment and {ref}/put-inference-api.html[{infer} endpoint].


[discrete]
Expand All @@ -62,7 +63,8 @@ When adaptive resources are enabled, the number of vCPUs that the model deployme
When the load is high, the number of vCPUs that the process can use is automatically increased.
When the load is low, the number of vCPUs that the process can use is automatically decreased.

You can choose from three levels of resource usage for your trained model deployment.
You can choose from three levels of resource usage for your trained model deployment; autoscaling will occur within the selected level's range.

Refer to the tables in the <<auto-scaling-matrix>> section to find out the setings for the level you selected.


Expand All @@ -78,13 +80,14 @@ The used resources for trained model deployments depend on three factors:

* your cluster environment (Serverless, Cloud, or on-premises)
* the use case you optimize the model deployment for (ingest or search)
* whether adaptive resources are enabled or disabled (dynamic or static resources)
* whether model autoscaling is enabled with adaptive allocations/resources to have dynamic resources, or disabled for static resources

If you use {es} on-premises, vCPUs level ranges are derived from the `total_ml_processors` and `max_single_ml_node_processors` values.
Use the {ref}/get-ml-info.html[get {ml} info API] to check these values.
The following tables show you the number of allocations, threads, and vCPUs available in Cloud when adaptive resources are enabled or disabled.

NOTE: For Observability and Security projects on Serverless, adaptive allocations are automatically enabled, and the "Adaptive resources" control is not displayed in {kib}.
NOTE: On Serverless, adaptive allocations are automatically enabled for all project types.
However, the "Adaptive resources" control is not displayed in {kib} for Observability and Security projects.


[discrete]
Expand Down
8 changes: 4 additions & 4 deletions docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -459,10 +459,10 @@ To gain the biggest value out of ELSER trained models, consider to follow this l
* Setting `min_allocations` to `0` can save on costs for non-critical use cases or testing environments.
* Enabling <<ml-nlp-auto-scale,autoscaling>> through adaptive allocations or adaptive resources makes it possible for {es} to scale up or down the available resources of your ELSER deployment based on the load on the process.

* Use two ELSER {infer} endpoints: one optimized for ingest and one optimized for search.
** In {kib}, you can select for which case you want to optimize your ELSER deployment.
** If you use the {infer} API and want to optimize your ELSER endpoint for ingest, set the number of threads to `1` (`"num_threads": 1`).
** If you use the {infer} API and want to optimize your ELSER endpoint for search, set the number of threads to greater than `1`.
* Use dedicated, optimized ELSER {infer} endpoints for ingest and search use cases.
** When deploying a trained model in {kib}, you can select for which case you want to optimize your ELSER deployment.
** If you use the trained model or {infer} APIs and want to optimize your ELSER trained model deployment or {infer} endpoint for ingest, set the number of threads to `1` (`"num_threads": 1`).
** If you use the trained model or {infer} APIs and want to optimize your ELSER trained model deployment or {infer} endpoint for search, set the number of threads to greater than `1`.


[discrete]
Expand Down

0 comments on commit 6232894

Please sign in to comment.