> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# FlexAI Inference Autoscaling

> Set up and manage autoscaling rules for FlexAI Inference Endpoints

## Overview

<Tabs>
  <Tab title="Using the FlexAI Console">
    FlexAI Inference Endpoints support automatic scaling to handle varying traffic loads efficiently. Autoscaling allows you to define minimum and maximum accel counts, ensuring your endpoints can scale up during high demand and scale down during low usage to optimize costs.

    Autoscaling in FlexAI provides several key benefits:

    * **Cost Optimization**: Scale down to zero accels when there's no traffic
    * **Performance**: Automatically scale up to handle increased demand
    * **Resource Management**: Set upper bounds to control maximum resource usage
    * **Fixed Scaling**: Lock accels to a specific count by setting min and max to the same value
  </Tab>

  <Tab title="Using the FlexAI CLI">
    FlexAI Inference Endpoints support automatic scaling to handle varying traffic loads efficiently. Autoscaling allows you to define minimum and maximum replica counts, ensuring your endpoints can scale up during high demand and scale down during low usage to optimize costs.

    Autoscaling in FlexAI provides several key benefits:

    * **Cost Optimization**: Scale down to zero replicas when there's no traffic
    * **Performance**: Automatically scale up to handle increased demand
    * **Resource Management**: Set upper bounds to control maximum resource usage
    * **Fixed Scaling**: Lock replicas to a specific count by setting min and max to the same value
  </Tab>
</Tabs>

## Configuration Options

<Tabs>
  <Tab title="Using the FlexAI Console">
    #### Minimum Accels

    The minimum number of accels for your Inference Endpoint. This value determines the baseline capacity that will always be available.

    * **Setting to 0**: Allows the endpoint to scale down completely when there are no requests, optimizing costs
    * **Setting above 0**: Ensures a baseline level of availability and faster response times for initial requests

    #### Maximum Accels

    The maximum number of accels that can be created for your Inference Endpoint. This acts as an upper bound to control resource usage and costs.

    * **Should be** greater than or equal to the minimum accels
    * **Setting too high**: May lead to unexpected costs during traffic spikes
  </Tab>

  <Tab title="Using the FlexAI CLI">
    #### Minimum Replicas

    The minimum number of replicas for your Inference Endpoint. This value determines the baseline capacity that will always be available.

    * **Setting to 0**: Allows the endpoint to scale down completely when there are no requests, optimizing costs
    * **Setting above 0**: Ensures a baseline level of availability and faster response times for initial requests

    #### Maximum Replicas

    The maximum number of replicas that can be created for your Inference Endpoint. This acts as an upper bound to control resource usage and costs.

    * **Should be** greater than or equal to the minimum replicas
    * **Setting too high**: May lead to unexpected costs during traffic spikes
  </Tab>
</Tabs>

## Managing Autoscaling

### Setting Scaling Policies

<Tabs>
  <Tab title="Using the FlexAI Console">
    <Steps>
      <Step title="Navigate to Inference">
        Navigate to the **Inference** section in the FlexAI Console and select the "+ New" button or open the form directly
      </Step>

      <Step title="Fill out the form">
        Fill out the form with the required values
      </Step>

      <Step title="Enable Auto-scaling">
        Select the "Enable Auto-scaling" toggle element to enable this section of the form
      </Step>

      <Step title="Adjust scaling values">
        Adjust the **Min Accels** and **Max Accels** values
      </Step>
    </Steps>
  </Tab>

  <Tab title="Using the FlexAI CLI">
    Use the `flexai inference scale` command to configure autoscaling for your Inference Endpoints:

    ```bash theme={null}
    flexai inference scale <inference_endpoint_name> --min-replicas <min_replicas> --max-replicas <max_replicas>
    ```

    ##### Flags

    | Flag             | Type    | Description                                                |
    | ---------------- | ------- | ---------------------------------------------------------- |
    | `--max-replicas` | integer | The maximum number of replicas for the Inference Endpoint. |
    | `--min-replicas` | integer | The minimum number of replicas for the Inference Endpoint. |
  </Tab>
</Tabs>

## Common Scaling Scenarios

<Tabs>
  <Tab title="Using the FlexAI Console">
    ### Cost-Optimized Scaling

    For development or low-traffic endpoints where cost optimization is prioritized:

    | Auto-scaling Policy | Value |
    | :------------------ | ----: |
    | Min Accels          |     0 |
    | Max Accels          |     2 |

    This configuration allows the endpoint to:

    * Scale down to zero when there's no traffic (no costs)
    * Scale up to a maximum of 2 replicas during usage

    ### Performance-Oriented Scaling

    For production endpoints where consistent performance is crucial:

    | Auto-scaling Policy | Value |
    | :------------------ | ----: |
    | Min Accels          |     2 |
    | Max Accels          |    10 |

    This configuration ensures:

    * Always maintains 2 replicas for immediate response
    * Can scale up to 10 replicas during high demand

    ### Fixed Scaling

    For endpoints requiring consistent resources or predictable performance:

    | Auto-scaling Policy | Value |
    | :------------------ | ----: |
    | Min Accels          |     4 |
    | Max Accels          |     4 |

    This configuration:

    * Disables autoscaling by setting min and max to the same value
    * Always maintains exactly 4 replicas

    ***

    ## Other Example Scenarios

    ### Development Environment

    Testing and development workloads

    | Auto-scaling Policy | Value |
    | :------------------ | ----: |
    | Min Accels          |     0 |
    | Max Accels          |     1 |

    Scales to zero when not in use, single replica maximum for cost control.

    ### Production API

    High-availability production service

    | Auto-scaling Policy | Value |
    | :------------------ | ----: |
    | Min Accels          |     3 |
    | Max Accels          |    15 |

    Maintains 3 replicas for consistent availability, scales up to 15 replicas during peak traffic.

    ### Batch Processing

    Predictable batch workloads

    | Auto-scaling Policy | Value |
    | :------------------ | ----: |
    | Min Accels          |     5 |
    | Max Accels          |     5 |

    Fixed scaling for consistent processing capacity, prevents resource contention.

    ## Best Practices

    ### Choosing Minimum Replicas

    * **Set to 0** for development, testing, or infrequently used endpoints
    * **Set to 1-2** for production endpoints that need quick response times
    * **Set higher** for high-availability services or when you need guaranteed capacity

    ### Choosing Maximum Replicas

    * Consider your infrastructure limits and budget constraints
    * Monitor typical traffic patterns to set appropriate upper bounds
    * Leave room for traffic spikes (set 20-50% above normal peak usage)

    ### Monitoring and Optimization

    <Tip>
      Monitor your endpoint [logs and life cycles](/core-services/inference/quickstart/create-public/#checking-the-status-of-your-inference-endpoint) as well as the [Infrastructure Monitor](/platform-services/observability/infrastructure-monitor/) to understand usage patterns and optimize your scaling configuration over time.
    </Tip>

    * Track response times and error rates during scaling events
    * Monitor costs and adjust min/max values based on usage patterns
    * Use gradual changes when adjusting scaling policies in production

    ## Troubleshooting

    ### Common Issues

    **Slow Response Times**: If your endpoint takes too long to respond after being idle:

    * Increase `Min Accels` to maintain warm replicas
    * Consider the cold-start time of your model

    **High Costs**: If your endpoint is more expensive than expected:

    * Reduce `Max Accels` to limit peak resource usage
    * Set `Min Accels` to 0 for non-critical endpoints

    **Capacity Issues**: If your endpoint can't handle traffic spikes:

    * Increase `Max Accels` to allow for more scaling headroom
    * Monitor scaling metrics to understand demand patterns
  </Tab>

  <Tab title="Using the FlexAI CLI">
    ### Cost-Optimized Scaling

    For development or low-traffic endpoints where cost optimization is prioritized:

    ```bash theme={null}
    flexai inference scale my_dev_endpoint --min-replicas 0 --max-replicas 2
    ```

    This configuration allows the endpoint to:

    * Scale down to zero when there's no traffic (no costs)
    * Scale up to a maximum of 2 replicas during usage

    ### Performance-Oriented Scaling

    For production endpoints where consistent performance is crucial:

    ```bash theme={null}
    flexai inference scale my_prod_endpoint --min-replicas 2 --max-replicas 10
    ```

    This configuration ensures:

    * Always maintains 2 replicas for immediate response
    * Can scale up to 10 replicas during high demand

    ### Fixed Scaling

    For endpoints requiring consistent resources or predictable performance:

    ```bash theme={null}
    flexai inference scale my_stable_endpoint --min-replicas 4 --max-replicas 4
    ```

    This configuration:

    * Disables autoscaling by setting min and max to the same value
    * Always maintains exactly 4 replicas

    ***

    ## Other Example Scenarios

    ### Development Environment

    Testing and development workloads

    ```bash theme={null}
    flexai inference scale mistral7b-dev --min-replicas 0 --max-replicas 1
    ```

    Scales to zero when not in use, single replica maximum for cost control.

    ### Production API

    High-availability production service

    ```bash theme={null}
    flexai inference scale my_prod_inference_endpoint --min-replicas 3 --max-replicas 15
    ```

    Maintains 3 replicas for consistent availability, scales up to 15 replicas during peak traffic.

    ### Batch Processing

    Predictable batch workloads

    ```bash theme={null}
    flexai inference scale my_batch_inference_endpoint --min-replicas 5 --max-replicas 5
    ```

    Fixed scaling for consistent processing capacity, prevents resource contention.

    ## Best Practices

    ### Choosing Minimum Replicas

    * **Set to 0** for development, testing, or infrequently used endpoints
    * **Set to 1-2** for production endpoints that need quick response times
    * **Set higher** for high-availability services or when you need guaranteed capacity

    ### Choosing Maximum Replicas

    * Consider your infrastructure limits and budget constraints
    * Monitor typical traffic patterns to set appropriate upper bounds
    * Leave room for traffic spikes (set 20-50% above normal peak usage)

    ### Monitoring and Optimization

    <Tip>
      Monitor your endpoint [logs and life cycles](/core-services/inference/quickstart/create-public/#checking-the-status-of-your-inference-endpoint) as well as the [Infrastructure Monitor](/platform-services/observability/infrastructure-monitor/) to understand usage patterns and optimize your scaling configuration over time.
    </Tip>

    * Track response times and error rates during scaling events
    * Monitor costs and adjust min/max values based on usage patterns
    * Use gradual changes when adjusting scaling policies in production

    ## Troubleshooting

    ### Common Issues

    **Slow Response Times**: If your endpoint takes too long to respond after being idle:

    * Increase `--min-replicas` to maintain warm replicas
    * Consider the cold-start time of your model

    **High Costs**: If your endpoint is more expensive than expected:

    * Reduce `--max-replicas` to limit peak resource usage
    * Set `--min-replicas` to 0 for non-critical endpoints

    **Capacity Issues**: If your endpoint can't handle traffic spikes:

    * Increase `--max-replicas` to allow for more scaling headroom
    * Monitor scaling metrics to understand demand patterns
  </Tab>
</Tabs>
