> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# inference serve

> Create a new inference endpoint from a Hugging Face model

Creates an Inference Endpoint from a model hosted by Hugging Face. The target model must be part of the [vLLM supported models list](https://docs.vllm.ai/en/latest/models/supported_models.html).

## Usage

```bash theme={null}
flexai inference serve <inference_endpoint_name> [
    --accels <number_of_accelerators>
    --accel-sm-slices <number_of_slices>
    --affinity <key1=value1,key2=value2,...>
    --api-key-secret <flexai_secret_name>
    --checkpoint <checkpoint_name_or_uuid>
    --device-arch <device_architecture>
    --hf-token-secret <flexai_secret_name>
    --max-replicas <max_replicas>
    --min-replicas <min_replicas>
    --no-queuing
    --runtime <runtime_name>
  ] (-- --model=<model_name> [<VLLM_Arguments...>])
```

## Arguments

| Argument                  | Type        | Required | Description                                                                                                                                                                                                                                |
| ------------------------- | ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `inference_endpoint_name` | string      | Yes      | The name of the Inference Endpoint to create. It must be unique within the organization and must follow the [Resource Naming Conventions](/best-practices/resource-naming-conventions/).                                                   |
| `model_name`              | string      | Yes      | The name of the model to use for the Inference Endpoint. Visit the [vLLM supported models list](https://docs.vllm.ai/en/latest/models/supported_models.html) to see the list of supported models.                                          |
| `vllm_args`               | option-list | No       | [vLLM Engine Arguments](https://docs.vllm.ai/en/latest/models/supported_models.html) that can be passed after the End-of-options marker (`--`). Note: The `--device` argument is not supported: FlexAI handles the device selection tasks. |

## Flags

| Flag                | Short | Type      | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ------------------- | ----- | --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--accel-sm-slices` | `-s`  | integer   | Number of slices to divide each SM into.                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| `--accels`          | `-a`  | integer   | Number of accelerators/GPUs to use. Default: `1`                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| `--affinity`        |       | key-value | Pins the Inference Endpoint to a specific cluster. Format: `cluster=<cluster_name>`. Use [`flexai cluster list`](/cli/reference/cluster/list) to see clusters available to your organization. Useful when you need a specific accelerator type (for example, NVIDIA H100 vs A100) that lives on a particular cluster. The only recognized key today is `cluster`. When set, this overrides `--device-arch`: the cluster's hardware determines the accelerator architecture. Default: `[]` |
| `--api-key-secret`  | `-K`  | string    | The name of a [FlexAI Secret](/platform-services/secret-manager) containing the API key you want to set to protect the Inference Endpoint. If not provided: A FlexAI Secret will be created. Its value will be prompted ─ only once ─ after the creation of the Inference Endpoint. A new secret with the name `<inference_endpoint_name>-api-key` containing the auto-generated API key will be created.                                                                                 |
| `--checkpoint`      | `-C`  | string    | A Checkpoint to serve the Inference Endpoint from. The name of a previously pushed Checkpoint (use [`flexai checkpoint list`](/cli/reference/checkpoint/list) to see available Checkpoints) or the UUID of an [*Inference Ready* Checkpoint](/platform-services/checkpoint-manager/inference-ready-checkpoints/) generated during the execution of a Training or Fine-tuning job (use [`flexai training checkpoints`](/cli/reference/training/checkpoints) to see available Checkpoints). |
| `--device-arch`     | `-d`  | string    | The architecture of the device to run the Inference Endpoint on. Default: `nvidia`                                                                                                                                                                                                                                                                                                                                                                                                        |
| `--help`            | `-h`  | boolean   | Displays this help page.                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| `--hf-token-secret` | `-T`  | string    | The name of the FlexAI Secret containing the Hugging Face token that will be used to access the model.                                                                                                                                                                                                                                                                                                                                                                                    |
| `--max-replicas`    |       | integer   | The maximum number of replicas to use for the Inference Endpoint. Visit the [FlexAI Inference Autoscaling](/core-services/inference/autoscaling) page to learn more about how autoscaling works and how to configure it.                                                                                                                                                                                                                                                                  |
| `--min-replicas`    |       | integer   | The minimum number of replicas to use for the Inference Endpoint. Visit the [FlexAI Inference Autoscaling](/core-services/inference/autoscaling) page to learn more about how autoscaling works and how to configure it.                                                                                                                                                                                                                                                                  |
| `--no-queuing`      |       | boolean   | Disable queuing for the Inference Endpoint. This means that if there are not enough resources available in the cluster, the request will be rejected immediately instead of being queued.                                                                                                                                                                                                                                                                                                 |
| `--runtime`         | `-r`  | string    | The name of the runtime to use for the Inference Endpoint. If not provided, the default runtime set for the organization will be used.                                                                                                                                                                                                                                                                                                                                                    |
| `--verbose`         | `-v`  | boolean   | Provides more detailed output when initiating an Inference Serving operation.                                                                                                                                                                                                                                                                                                                                                                                                             |

## Examples

### Gated Models

Keep in mind that some models are "Gated", meaning that you need to go through a process of agreeing to their license agreement, privacy policy, or similar before you can use them.

You can visit the model's page on the Hugging Face Hub to see if it is marked as "Gated". Gated models can be identified by a special indicator symbol.

If the model is "Gated", you will find the necessary information on how to proceed.

<img src="https://mintcdn.com/flexai-51114f49/dNvzWOTcJQw0Ozyk/assets/images/other/hugging-face-model-gated--mistral-gated-before--light.jpg?fit=max&auto=format&n=dNvzWOTcJQw0Ozyk&q=85&s=771386d77514b2ef42e0151a1ef0e86c" alt="Mistral-7B-v0.1 model page on Hugging Face Hub: Before being granted access" width="804" height="428" data-path="assets/images/other/hugging-face-model-gated--mistral-gated-before--light.jpg" />

If you have already gone through the process, you will find a badge on the model's page indicating that you have access to the model.

<img src="https://mintcdn.com/flexai-51114f49/dNvzWOTcJQw0Ozyk/assets/images/other/hugging-face-model-gated--mistral-gated-after--light.jpg?fit=max&auto=format&n=dNvzWOTcJQw0Ozyk&q=85&s=4637fe30850f7b5464680cd603fe4b97" alt="Mistral-7B-v0.1 model page on Hugging Face Hub: After being granted access" width="1750" height="1188" data-path="assets/images/other/hugging-face-model-gated--mistral-gated-after--light.jpg" />

<Note>
  Make sure you go through the process using **the same Hugging Face account** that owns the Access Token stored in the Secret passed to the `--hf-token-secret` flag.
</Note>

Learn more about deploying an Inference Endpoint from a Private or Gated model in the [Creating an Inference Endpoint: Private Model](/core-services/inference/quickstart/create-private/) quickstart guide.
