Key Features
Fast Deployment
Get models running in minutes
Auto-scaling
Automatic scaling based on demand
Multiple Models
Support for multiple vLLM supported models
Cost Optimization
Pay only for what you use
Monitoring
Built-in metrics and logging
Anatomy of an Inference Endpoint
A FlexAI Inference Endpoint consists of the following components:- Model to Serve: The model you want to serve through a FlexAI Inference Endpoint.
- A Checkpoint generated by a Training or Fine-tuning Job ran with FlexAI.
- Authentication Key: The secret key required for querying the Inference Endpoint.
- Hugging Face Token’s Secret Name: The name of the Secret stored in the FlexAI Secret Manager that holds a Hugging Face Token with access to the model you want to serve.
- Endpoint Configuration: Settings that define how the endpoint behaves, such as scaling policies, and vLLM-specific arguments.
Model to Serve
The model served by your Inference Endpoint can be:- A public or private model from the Hugging Face Model Hub
- A Checkpoint you manually pushed to the FlexAI Checkpoint Manager
- A Checkpoint generated by a FlexAI Training or Fine-tuning Job
Authentication Key
Each Inference Endpoint is protected by an authentication key (secret key). This key must be provided with every request to the endpoint, ensuring that only authorized users can access your deployed models. You can create your own authentication key by assigning it to a Secret in the FlexAI Secret Manager. If an authentication key is not provided, FlexAI will create one for your Inference Endpoint and associate it with a Secret with the name of your Inference Endpoint:<inference_endpoint_name>-api-key.
This enables you to update the Secret in case you want to refresh the authentication key.
Hugging Face Token’s Secret Name
If your model is gated or private on Hugging Face, you will need to provide a Hugging Face Access Token. Store this token as a Secret in the FlexAI Secret Manager, and reference its name in your endpoint configuration. FlexAI will inject this token securely at runtime, allowing the endpoint to pull the model as needed.Endpoint Configuration
Endpoint configuration defines how your Inference Endpoint behaves. This includes:- Scaling policies (auto-scaling, min/max replicas)
- vLLM-specific arguments (e.g., batch size, max tokens)
Key Concepts
Endpoint Authentication
FlexAI Inference Endpoints are protected by an authentication token that you can provide during setup. This ensures that only authorized users can access your models.Auto-scaling
FlexAI Inference Endpoints can automatically scale based on demand. This means that during periods of high traffic, additional resources will be allocated to handle requests, ensuring consistent performance, and during low traffic periods, resources will be scaled back down to optimize efficient resource utilization.vLLM
FlexAI Inference Endpoints leverage the vLLM library for efficient model serving. vLLM is designed to optimize the performance of large language models, making it easier to deploy and scale them in production environments. You can learn more about vLLM and how to take advantage of its features in the vLLM documentation.CLI Reference
The FlexAI CLI’sinference family of commands allows you to manage your inference endpoints directly from the command line.
Getting Started
FlexAI Inference Endpoints can be deployed in a few steps. The Quickstart guide will walk you through the process of setting up your first endpoint. Here’s a brief overview of the steps involved:- Deploying a FlexAI Inference Endpoint from a model hosted in the Hugging Face Hub.
- Deploying a FlexAI Inference Endpoint from a model fine-tuned with FlexAI.
- Querying a FlexAI Inference Endpoint.