inference scale

Allows for the definition of scaling policies for an Inference Endpoint.

A minimum and maximum number of replicas can be defined, allowing you set the upper and lower bounds for the number of replicas that can be created for an Inference Endpoint.

The lower bound can be set to 0, which means that the Inference Endpoint will not be running when there are no requests.

Note: --max-replicas and --min-replicas can be set to the same value (as long as it is greater than 0), which will effectively lock any auto-scaling and keep the number of replicas to that value.

flexai inference scale <inference_endpoint_name> --min-replicas <min_replicas> --max-replicas <max_replicas>

Flags

max-replicas

Optional

Integer

The maximum number of replicas for the Inference Endpoint.

Examples

min-replicas

Optional

Integer

The maximum number of replicas for the Inference Endpoint.

Examples

Set the mistral7b-i Inference Endpoint to be able to scale down to 0 and go up to 8 replicas:

flexai inference scale mistral7b-i --min-replicas 0 --max-replicas 8

Set the Qwen3-Coder-480B Inference Endpoint to have a minimum of 2 replicas and a maximum of 4 replicas:

flexai inference scale Qwen3-Coder-480B --min-replicas 2 --max-replicas 4

Make the DeepSeek-V3 Inference Endpoint run with 6 replicas all the time:

flexai inference scale DeepSeek-V3 --min-replicas 6 --max-replicas 6