The maximum number of replicas for the Inference Endpoint.
Examples
-
1 -
2 -
6 -
8
FlexAI Inference Endpoints support automatic scaling to handle varying traffic loads efficiently. Autoscaling allows you to define minimum and maximum accel counts, ensuring your endpoints can scale up during high demand and scale down during low usage to optimize costs.
Autoscaling in FlexAI provides several key benefits:
FlexAI Inference Endpoints support automatic scaling to handle varying traffic loads efficiently. Autoscaling allows you to define minimum and maximum replica counts, ensuring your endpoints can scale up during high demand and scale down during low usage to optimize costs.
Autoscaling in FlexAI provides several key benefits:
The minimum number of accels for your Inference Endpoint. This value determines the baseline capacity that will always be available.
The maximum number of accels that can be created for your Inference Endpoint. This acts as an upper bound to control resource usage and costs.
The minimum number of replicas for your Inference Endpoint. This value determines the baseline capacity that will always be available.
The maximum number of replicas that can be created for your Inference Endpoint. This acts as an upper bound to control resource usage and costs.
Use the flexai inference scale command to configure autoscaling for your Inference Endpoints:
flexai inference scale <inference_endpoint_name> --min-replicas <min_replicas> --max-replicas <max_replicas>The maximum number of replicas for the Inference Endpoint.
1 2 6 8 The minimum number of replicas for the Inference Endpoint.
0 1 3 8 For development or low-traffic endpoints where cost optimization is prioritized:
| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 0 |
| Max Accels | 2 |
This configuration allows the endpoint to:
For production endpoints where consistent performance is crucial:
| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 2 |
| Max Accels | 10 |
This configuration ensures:
For endpoints requiring consistent resources or predictable performance:
| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 4 |
| Max Accels | 4 |
This configuration:
Testing and development workloads
| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 0 |
| Max Accels | 1 |
Scales to zero when not in use, single replica maximum for cost control.
High-availability production service
| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 3 |
| Max Accels | 15 |
Maintains 3 replicas for consistent availability, scales up to 15 replicas during peak traffic.
Predictable batch workloads
| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 5 |
| Max Accels | 5 |
Fixed scaling for consistent processing capacity, prevents resource contention.
Slow Response Times: If your endpoint takes too long to respond after being idle:
Min Accels to maintain warm replicasHigh Costs: If your endpoint is more expensive than expected:
Max Accels to limit peak resource usageMin Accels to 0 for non-critical endpointsCapacity Issues: If your endpoint can’t handle traffic spikes:
Max Accels to allow for more scaling headroomFor development or low-traffic endpoints where cost optimization is prioritized:
flexai inference scale my_dev_endpoint --min-replicas 0 --max-replicas 2This configuration allows the endpoint to:
For production endpoints where consistent performance is crucial:
flexai inference scale my_prod_endpoint --min-replicas 2 --max-replicas 10This configuration ensures:
For endpoints requiring consistent resources or predictable performance:
flexai inference scale my_stable_endpoint --min-replicas 4 --max-replicas 4This configuration:
Testing and development workloads
flexai inference scale mistral7b-dev --min-replicas 0 --max-replicas 1Scales to zero when not in use, single replica maximum for cost control.
High-availability production service
flexai inference scale my_prod_inference_endpoint --min-replicas 3 --max-replicas 15Maintains 3 replicas for consistent availability, scales up to 15 replicas during peak traffic.
Predictable batch workloads
flexai inference scale my_batch_inference_endpoint --min-replicas 5 --max-replicas 5Fixed scaling for consistent processing capacity, prevents resource contention.
Slow Response Times: If your endpoint takes too long to respond after being idle:
--min-replicas to maintain warm replicasHigh Costs: If your endpoint is more expensive than expected:
--max-replicas to limit peak resource usage--min-replicas to 0 for non-critical endpointsCapacity Issues: If your endpoint can’t handle traffic spikes:
--max-replicas to allow for more scaling headroom