> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# FlexAI Inference Endpoints

> Deploy and manage AI models for inference with FlexAI

Deploy and manage AI models as HTTP API endpoints that you can securely query from your applications.

You can deploy public and private models hosted on the Hugging Face Model Hub or deploy models you've fine-tuned using FlexAI.

## Key Features

<CardGroup>
  <Card title="Fast Deployment" icon="rocket">
    Get models running in minutes
  </Card>

  <Card title="Auto-scaling" icon="random">
    Automatic scaling based on demand
  </Card>

  <Card title="Multiple Models" icon="puzzle">
    Support for multiple vLLM supported models
  </Card>

  <Card title="Cost Optimization" icon="star">
    Pay only for what you use
  </Card>

  <Card title="Monitoring" icon="information">
    Built-in metrics and logging
  </Card>
</CardGroup>

## Anatomy of an Inference Endpoint

A FlexAI Inference Endpoint consists of the following components:

1. **Model to Serve**: The model you want to serve through a FlexAI Inference Endpoint.
   * A Checkpoint generated by a Training or Fine-tuning Job ran with FlexAI.
2. **Authentication Key**: The secret key required for querying the Inference Endpoint.
3. ***Hugging Face Token*'s Secret Name**: The name of the Secret stored in the FlexAI Secret Manager that holds a Hugging Face Token with access to the model you want to serve.
4. **Endpoint Configuration**: Settings that define how the endpoint behaves, such as scaling policies, and vLLM-specific arguments.

### Model to Serve

The model served by your Inference Endpoint can be:

* A public or private model from the Hugging Face Model Hub
* A Checkpoint you manually pushed to the FlexAI Checkpoint Manager
* A Checkpoint generated by a FlexAI Training or Fine-tuning Job

This flexibility allows you to deploy both open-source and custom models as scalable, production-ready endpoints.

### Authentication Key

Each Inference Endpoint is protected by an authentication key (secret key). This key must be provided with every request to the endpoint, ensuring that only authorized users can access your deployed models. You can create your own authentication key by assigning it to a ***Secret*** in the FlexAI Secret Manager.

If an authentication key is not provided, FlexAI will create one for your Inference Endpoint and associate it with a ***Secret*** with the name of your Inference Endpoint: `<inference_endpoint_name>-api-key`.

This enables you to update the Secret in case you want to refresh the authentication key.

### Hugging Face Token's Secret Name

If your model is gated or private on Hugging Face, you will need to provide a Hugging Face Access Token. Store this token as a Secret in the FlexAI Secret Manager, and reference its name in your endpoint configuration. FlexAI will inject this token securely at runtime, allowing the endpoint to pull the model as needed.

### Endpoint Configuration

Endpoint configuration defines how your Inference Endpoint behaves. This includes:

* Scaling policies (auto-scaling, min/max replicas)
* [vLLM-specific arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html) (e.g., batch size, max tokens)

These settings allow you to optimize performance, cost, and reliability for your specific use case.

## Key Concepts

### Endpoint Authentication

FlexAI Inference Endpoints are protected by an authentication token that you can provide during setup. This ensures that only authorized users can access your models.

### Auto-scaling

FlexAI Inference Endpoints can automatically scale based on demand. This means that during periods of high traffic, additional resources will be allocated to handle requests, ensuring consistent performance, and during low traffic periods, resources will be scaled back down to optimize efficient resource utilization.

## vLLM

FlexAI Inference Endpoints leverage the vLLM library for efficient model serving. vLLM is designed to optimize the performance of large language models, making it easier to deploy and scale them in production environments.

You can learn more about vLLM and how to take advantage of its features in the [vLLM documentation](https://docs.vllm.ai/en/stable/usage/index.html).

## CLI Reference

The FlexAI CLI's [`inference` family of commands](/cli/reference/inference) allows you to manage your inference endpoints directly from the command line.

## Getting Started

FlexAI Inference Endpoints can be deployed in a few steps. The Quickstart guide will walk you through the process of setting up your first endpoint. Here's a brief overview of the steps involved:

1. Deploying a FlexAI Inference Endpoint from a model hosted in the Hugging Face Hub.

2. Deploying a FlexAI Inference Endpoint from a model fine-tuned with FlexAI.

3. Querying a FlexAI Inference Endpoint.

The button below will lead you to the FlexAI Inference Endpoints Quickstart guide's overview, where you'll find more details.