Skip to content

inference scale

Allows for the definition of scaling policies for an Inference Endpoint.

A minimum and maximum number of replicas can be defined, allowing you set the upper and lower bounds for the number of replicas that can be created for an Inference Endpoint.

The lower bound can be set to 0, which means that the Inference Endpoint will not be running when there are no requests.

Note: --max-replicas and --min-replicas can be set to the same value (as long as it is greater than 0), which will effectively lock any auto-scaling and keep the number of replicas to that value.

Terminal window
flexai inference scale <inference_endpoint_name> --min-replicas <min_replicas> --max-replicas <max_replicas>
Integer

The maximum number of replicas for the Inference Endpoint.

Examples
  • 1
  • 2
  • 6
  • 8
Integer

The maximum number of replicas for the Inference Endpoint.

Examples
  • 0
  • 1
  • 3
  • 8

Set the mistral7b-i Inference Endpoint to be able to scale down to 0 and go up to 8 replicas:

Terminal window
flexai inference scale mistral7b-i --min-replicas 0 --max-replicas 8

Set the Qwen3-Coder-480B Inference Endpoint to have a minimum of 2 replicas and a maximum of 4 replicas:

Terminal window
flexai inference scale Qwen3-Coder-480B --min-replicas 2 --max-replicas 4

Make the DeepSeek-V3 Inference Endpoint run with 6 replicas all the time:

Terminal window
flexai inference scale DeepSeek-V3 --min-replicas 6 --max-replicas 6