The maximum number of replicas for the Inference Endpoint.
Examples
-
1
-
2
-
6
-
8
Allows for the definition of scaling policies for an Inference Endpoint.
A minimum and maximum number of replicas can be defined, allowing you set the upper and lower bounds for the number of replicas that can be created for an Inference Endpoint.
The lower bound can be set to 0, which means that the Inference Endpoint will not be running when there are no requests.
Note:
--max-replicas
and--min-replicas
can be set to the same value (as long as it is greater than0
), which will effectively lock any auto-scaling and keep the number of replicas to that value.
flexai inference scale <inference_endpoint_name> --min-replicas <min_replicas> --max-replicas <max_replicas>
The maximum number of replicas for the Inference Endpoint.
1
2
6
8
The maximum number of replicas for the Inference Endpoint.
0
1
3
8
Set the mistral7b-i
Inference Endpoint to be able to scale down to 0 and go up to 8 replicas:
flexai inference scale mistral7b-i --min-replicas 0 --max-replicas 8
Set the Qwen3-Coder-480B
Inference Endpoint to have a minimum of 2 replicas and a maximum of 4 replicas:
flexai inference scale Qwen3-Coder-480B --min-replicas 2 --max-replicas 4
Make the DeepSeek-V3
Inference Endpoint run with 6 replicas all the time:
flexai inference scale DeepSeek-V3 --min-replicas 6 --max-replicas 6