Skip to content

FlexAI Inference Autoscaling

FlexAI Inference Endpoints support automatic scaling to handle varying traffic loads efficiently. Autoscaling allows you to define minimum and maximum accel counts, ensuring your endpoints can scale up during high demand and scale down during low usage to optimize costs.

Autoscaling in FlexAI provides several key benefits:

  • Cost Optimization: Scale down to zero accels when there’s no traffic
  • Performance: Automatically scale up to handle increased demand
  • Resource Management: Set upper bounds to control maximum resource usage
  • Fixed Scaling: Lock accels to a specific count by setting min and max to the same value

The minimum number of accels for your Inference Endpoint. This value determines the baseline capacity that will always be available.

  • Setting to 0: Allows the endpoint to scale down completely when there are no requests, optimizing costs
  • Setting above 0: Ensures a baseline level of availability and faster response times for initial requests

The maximum number of accels that can be created for your Inference Endpoint. This acts as an upper bound to control resource usage and costs.

  • Should be greater than or equal to the minimum accels
  • Setting too high: May lead to unexpected costs during traffic spikes
  1. Navigate to the Inference section in the FlexAI Console and select the ”+ New” button or open the Launch a new inference endpoint form directly
  2. Fill out the form with the required values
  3. Select the “Enable Aut-scaling” toggle element to enable this section of the form
  4. Adjust the Min Accels and Max Accels values

Cost-Optimized Scaling

For development or low-traffic endpoints where cost optimization is prioritized:

Auto-scaling PolicyValue
Min Accels0
Max Accels2

This configuration allows the endpoint to:

  • Scale down to zero when there’s no traffic (no costs)
  • Scale up to a maximum of 2 replicas during usage

Performance-Oriented Scaling

For production endpoints where consistent performance is crucial:

Auto-scaling PolicyValue
Min Accels2
Max Accels10

This configuration ensures:

  • Always maintains 2 replicas for immediate response
  • Can scale up to 10 replicas during high demand

Fixed Scaling

For endpoints requiring consistent resources or predictable performance:

Auto-scaling PolicyValue
Min Accels4
Max Accels4

This configuration:

  • Disables autoscaling by setting min and max to the same value
  • Always maintains exactly 4 replicas

Other Example Scenarios

Development Environment

Testing and development workloads

Auto-scaling PolicyValue
Min Accels0
Max Accels1

Scales to zero when not in use, single replica maximum for cost control.

Production API

High-availability production service

Auto-scaling PolicyValue
Min Accels3
Max Accels15

Maintains 3 replicas for consistent availability, scales up to 15 replicas during peak traffic.

Batch Processing

Predictable batch workloads

Auto-scaling PolicyValue
Min Accels5
Max Accels5

Fixed scaling for consistent processing capacity, prevents resource contention.

Best Practices

Choosing Minimum Replicas

  • Set to 0 for development, testing, or infrequently used endpoints
  • Set to 1-2 for production endpoints that need quick response times
  • Set higher for high-availability services or when you need guaranteed capacity

Choosing Maximum Replicas

  • Consider your infrastructure limits and budget constraints
  • Monitor typical traffic patterns to set appropriate upper bounds
  • Leave room for traffic spikes (set 20-50% above normal peak usage)

Monitoring and Optimization

  • Track response times and error rates during scaling events
  • Monitor costs and adjust min/max values based on usage patterns
  • Use gradual changes when adjusting scaling policies in production

Troubleshooting

Common Issues

Slow Response Times: If your endpoint takes too long to respond after being idle:

  • Increase Min Accels to maintain warm replicas
  • Consider the cold-start time of your model

High Costs: If your endpoint is more expensive than expected:

  • Reduce Max Accels to limit peak resource usage
  • Set Min Accels to 0 for non-critical endpoints

Capacity Issues: If your endpoint can’t handle traffic spikes:

  • Increase Max Accels to allow for more scaling headroom
  • Monitor scaling metrics to understand demand patterns