Overview
- Using the FlexAI Console
- Using the FlexAI CLI
FlexAI Inference Endpoints support automatic scaling to handle varying traffic loads efficiently. Autoscaling allows you to define minimum and maximum accel counts, ensuring your endpoints can scale up during high demand and scale down during low usage to optimize costs.Autoscaling in FlexAI provides several key benefits:
- Cost Optimization: Scale down to zero accels when there’s no traffic
- Performance: Automatically scale up to handle increased demand
- Resource Management: Set upper bounds to control maximum resource usage
- Fixed Scaling: Lock accels to a specific count by setting min and max to the same value
Configuration Options
- Using the FlexAI Console
- Using the FlexAI CLI
Minimum Accels
The minimum number of accels for your Inference Endpoint. This value determines the baseline capacity that will always be available.- Setting to 0: Allows the endpoint to scale down completely when there are no requests, optimizing costs
- Setting above 0: Ensures a baseline level of availability and faster response times for initial requests
Maximum Accels
The maximum number of accels that can be created for your Inference Endpoint. This acts as an upper bound to control resource usage and costs.- Should be greater than or equal to the minimum accels
- Setting too high: May lead to unexpected costs during traffic spikes
Managing Autoscaling
Setting Scaling Policies
- Using the FlexAI Console
- Using the FlexAI CLI
1
Navigate to Inference
Navigate to the Inference section in the FlexAI Console and select the ”+ New” button or open the form directly
2
Fill out the form
Fill out the form with the required values
3
Enable Auto-scaling
Select the “Enable Auto-scaling” toggle element to enable this section of the form
4
Adjust scaling values
Adjust the Min Accels and Max Accels values
Common Scaling Scenarios
- Using the FlexAI Console
- Using the FlexAI CLI
Cost-Optimized Scaling
For development or low-traffic endpoints where cost optimization is prioritized:| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 0 |
| Max Accels | 2 |
- Scale down to zero when there’s no traffic (no costs)
- Scale up to a maximum of 2 replicas during usage
Performance-Oriented Scaling
For production endpoints where consistent performance is crucial:| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 2 |
| Max Accels | 10 |
- Always maintains 2 replicas for immediate response
- Can scale up to 10 replicas during high demand
Fixed Scaling
For endpoints requiring consistent resources or predictable performance:| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 4 |
| Max Accels | 4 |
- Disables autoscaling by setting min and max to the same value
- Always maintains exactly 4 replicas
Other Example Scenarios
Development Environment
Testing and development workloads| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 0 |
| Max Accels | 1 |
Production API
High-availability production service| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 3 |
| Max Accels | 15 |
Batch Processing
Predictable batch workloads| Auto-scaling Policy | Value |
|---|---|
| Min Accels | 5 |
| Max Accels | 5 |
Best Practices
Choosing Minimum Replicas
- Set to 0 for development, testing, or infrequently used endpoints
- Set to 1-2 for production endpoints that need quick response times
- Set higher for high-availability services or when you need guaranteed capacity
Choosing Maximum Replicas
- Consider your infrastructure limits and budget constraints
- Monitor typical traffic patterns to set appropriate upper bounds
- Leave room for traffic spikes (set 20-50% above normal peak usage)
Monitoring and Optimization
- Track response times and error rates during scaling events
- Monitor costs and adjust min/max values based on usage patterns
- Use gradual changes when adjusting scaling policies in production
Troubleshooting
Common Issues
Slow Response Times: If your endpoint takes too long to respond after being idle:- Increase
Min Accelsto maintain warm replicas - Consider the cold-start time of your model
- Reduce
Max Accelsto limit peak resource usage - Set
Min Accelsto 0 for non-critical endpoints
- Increase
Max Accelsto allow for more scaling headroom - Monitor scaling metrics to understand demand patterns