Skip to content

FlexAI Inference Endpoints

Deploy and manage AI models as HTTP API endpoints that you can securely query from your applications.

You can deploy public and private models hosted on the Hugging Face Model Hub or deploy models you’ve fine-tuned using FlexAI.

Fast Deployment

Get models running in minutes

Auto-scaling

Automatic scaling based on demand

Multiple Models

Support for multiple vLLM supported models

Cost Optimization

Pay only for what you use

Monitoring

Built-in metrics and logging

A FlexAI Inference Endpoint consists of the following components:

  1. Model to Serve: The model you want to serve through a FlexAI Inference Endpoint.
    • A Checkpoint generated by a Training or Fine-tuning Job ran with FlexAI.
  2. Authentication Key: The secret key required for querying the Inference Endpoint.
  3. Hugging Face Token’s Secret Name: The name of the Secret stored in the FlexAI Secret Manager that holds a Hugging Face Token with access to the model you want to serve.
  4. Endpoint Configuration: Settings that define how the endpoint behaves, such as scaling policies, and vLLM-specific arguments.

The model served by your Inference Endpoint can be:

  • A public or private model from the Hugging Face Model Hub
  • A Checkpoint you manually pushed to the FlexAI Checkpoint Manager
  • A Checkpoint generated by a FlexAI Training or Fine-tuning Job

This flexibility allows you to deploy both open-source and custom models as scalable, production-ready endpoints.

Each Inference Endpoint is protected by an authentication key (secret key). This key must be provided with every request to the endpoint, ensuring that only authorized users can access your deployed models. You can create your own authentication key by assigning it to a Secret in the FlexAI Secret Manager.

If an authentication key is not provided, FlexAI will create one for your Inference Endpoint and associate it with a Secret with the name of your Inference Endpoint: <inference_endpoint_name>-api-key.

This enables you to update the Secret in case you want to refresh the authentication key.

If your model is gated or private on Hugging Face, you will need to provide a Hugging Face Access Token. Store this token as a Secret in the FlexAI Secret Manager, and reference its name in your endpoint configuration. FlexAI will inject this token securely at runtime, allowing the endpoint to pull the model as needed.

Endpoint configuration defines how your Inference Endpoint behaves. This includes:

  • Scaling policies (auto-scaling, min/max replicas)
  • 🔗 (e.g., batch size, max tokens)

These settings allow you to optimize performance, cost, and reliability for your specific use case.

FlexAI Inference Endpoints are protected by an authentication token that you can provide during setup. This ensures that only authorized users can access your models.

FlexAI Inference Endpoints can automatically scale based on demand. This means that during periods of high traffic, additional resources will be allocated to handle requests, ensuring consistent performance, and during low traffic periods, resources will be scaled back down to optimize efficient resource utilization.

FlexAI Inference Endpoints leverage the vLLM library for efficient model serving. vLLM is designed to optimize the performance of large language models, making it easier to deploy and scale them in production environments.

You can learn more about vLLM and how to take advantage of its features in the 🔗 .

The FlexAI CLI’s inference family of commands allows you to manage your inference endpoints directly from the command line.

FlexAI Inference Endpoints can be deployed in a few steps. The Quickstart guide will walk you through the process of setting up your first endpoint. Here’s a brief overview of the steps involved:

  1. Deploying a FlexAI Inference Endpoint from a model hosted in the Hugging Face Hub.

  2. Deploying a FlexAI Inference Endpoint from a model fine-tuned with FlexAI.

  3. Querying a FlexAI Inference Endpoint.

The button below will lead you to the FlexAI Inference Endpoints Quickstart guide’s overview, where you’ll find more details.