Fast Deployment
Get models running in minutes
Deploy and manage AI models as HTTP API endpoints that you can securely query from your applications.
You can deploy public and private models hosted on the Hugging Face Model Hub or deploy models you’ve fine-tuned using FlexAI.
Fast Deployment
Get models running in minutes
Auto-scaling
Automatic scaling based on demand
Multiple Models
Support for multiple vLLM supported models
Cost Optimization
Pay only for what you use
Monitoring
Built-in metrics and logging
A FlexAI Inference Endpoint consists of the following components:
The model served by your Inference Endpoint can be:
This flexibility allows you to deploy both open-source and custom models as scalable, production-ready endpoints.
Each Inference Endpoint is protected by an authentication key (secret key). This key must be provided with every request to the endpoint, ensuring that only authorized users can access your deployed models. You can create your own authentication key by assigning it to a Secret in the FlexAI Secret Manager.
If an authentication key is not provided, FlexAI will create one for your Inference Endpoint and associate it with a Secret with the name of your Inference Endpoint: <inference_endpoint_name>-api-key.
This enables you to update the Secret in case you want to refresh the authentication key.
If your model is gated or private on Hugging Face, you will need to provide a Hugging Face Access Token. Store this token as a Secret in the FlexAI Secret Manager, and reference its name in your endpoint configuration. FlexAI will inject this token securely at runtime, allowing the endpoint to pull the model as needed.
Endpoint configuration defines how your Inference Endpoint behaves. This includes:
These settings allow you to optimize performance, cost, and reliability for your specific use case.
FlexAI Inference Endpoints are protected by an authentication token that you can provide during setup. This ensures that only authorized users can access your models.
FlexAI Inference Endpoints can automatically scale based on demand. This means that during periods of high traffic, additional resources will be allocated to handle requests, ensuring consistent performance, and during low traffic periods, resources will be scaled back down to optimize efficient resource utilization.
FlexAI Inference Endpoints leverage the vLLM library for efficient model serving. vLLM is designed to optimize the performance of large language models, making it easier to deploy and scale them in production environments.
You can learn more about vLLM and how to take advantage of its features in the 🔗 .
The FlexAI CLI’s inference family of commands allows you to manage your inference endpoints directly from the command line.
FlexAI Inference Endpoints can be deployed in a few steps. The Quickstart guide will walk you through the process of setting up your first endpoint. Here’s a brief overview of the steps involved:
Deploying a FlexAI Inference Endpoint from a model hosted in the Hugging Face Hub.
Deploying a FlexAI Inference Endpoint from a model fine-tuned with FlexAI.
Querying a FlexAI Inference Endpoint.
The button below will lead you to the FlexAI Inference Endpoints Quickstart guide’s overview, where you’ll find more details.