FlexAI Inference Autoscaling

Overview

Using the FlexAI Console
Using the FlexAI CLI

FlexAI Inference Endpoints support automatic scaling to handle varying traffic loads efficiently. Autoscaling allows you to define minimum and maximum accel counts, ensuring your endpoints can scale up during high demand and scale down during low usage to optimize costs.

Autoscaling in FlexAI provides several key benefits:

Cost Optimization: Scale down to zero accels when there’s no traffic
Performance: Automatically scale up to handle increased demand
Resource Management: Set upper bounds to control maximum resource usage
Fixed Scaling: Lock accels to a specific count by setting min and max to the same value

Minimum Accels

The minimum number of accels for your Inference Endpoint. This value determines the baseline capacity that will always be available.

Setting to 0: Allows the endpoint to scale down completely when there are no requests, optimizing costs
Setting above 0: Ensures a baseline level of availability and faster response times for initial requests

Maximum Accels

The maximum number of accels that can be created for your Inference Endpoint. This acts as an upper bound to control resource usage and costs.

Should be greater than or equal to the minimum accels
Setting too high: May lead to unexpected costs during traffic spikes

Managing Autoscaling

Setting Scaling Policies

Using the FlexAI Console
Using the FlexAI CLI

Navigate to the Inference section in the FlexAI Console and select the ”+ New” button or open the Launch a new inference endpoint

form directly
Fill out the form with the required values
Select the “Enable Aut-scaling” toggle element to enable this section of the form
Adjust the Min Accels and Max Accels values

Use the flexai inference scale command to configure autoscaling for your Inference Endpoints:

flexai inference scale <inference_endpoint_name> --min-replicas <min_replicas> --max-replicas <max_replicas>

Flags

max-replicas

Optional

Integer

The maximum number of replicas for the Inference Endpoint.

Examples

min-replicas

Optional

Integer

The minimum number of replicas for the Inference Endpoint.

Examples

Common Scaling Scenarios

Using the FlexAI Console
Using the FlexAI CLI

Cost-Optimized Scaling

For development or low-traffic endpoints where cost optimization is prioritized:

Auto-scaling Policy	Value
Min Accels	0
Max Accels	2

This configuration allows the endpoint to:

Scale down to zero when there’s no traffic (no costs)
Scale up to a maximum of 2 replicas during usage

Performance-Oriented Scaling

For production endpoints where consistent performance is crucial:

Auto-scaling Policy	Value
Min Accels	2
Max Accels	10

This configuration ensures:

Always maintains 2 replicas for immediate response
Can scale up to 10 replicas during high demand

Fixed Scaling

For endpoints requiring consistent resources or predictable performance:

Auto-scaling Policy	Value
Min Accels	4
Max Accels	4

This configuration:

Disables autoscaling by setting min and max to the same value
Always maintains exactly 4 replicas

Other Example Scenarios

Development Environment

Testing and development workloads

Auto-scaling Policy	Value
Min Accels	0
Max Accels	1

Scales to zero when not in use, single replica maximum for cost control.

Production API

High-availability production service

Auto-scaling Policy	Value
Min Accels	3
Max Accels	15

Maintains 3 replicas for consistent availability, scales up to 15 replicas during peak traffic.

Batch Processing

Predictable batch workloads

Auto-scaling Policy	Value
Min Accels	5
Max Accels	5

Fixed scaling for consistent processing capacity, prevents resource contention.

Best Practices

Choosing Minimum Replicas

Set to 0 for development, testing, or infrequently used endpoints
Set to 1-2 for production endpoints that need quick response times
Set higher for high-availability services or when you need guaranteed capacity

Choosing Maximum Replicas

Consider your infrastructure limits and budget constraints
Monitor typical traffic patterns to set appropriate upper bounds
Leave room for traffic spikes (set 20-50% above normal peak usage)

Monitoring and Optimization

Track response times and error rates during scaling events
Monitor costs and adjust min/max values based on usage patterns
Use gradual changes when adjusting scaling policies in production

Troubleshooting

Common Issues

Slow Response Times: If your endpoint takes too long to respond after being idle:

Increase Min Accels to maintain warm replicas
Consider the cold-start time of your model

High Costs: If your endpoint is more expensive than expected:

Reduce Max Accels to limit peak resource usage
Set Min Accels to 0 for non-critical endpoints

Capacity Issues: If your endpoint can’t handle traffic spikes:

Increase Max Accels to allow for more scaling headroom
Monitor scaling metrics to understand demand patterns