Skip to content

inference serve

Creates an Inference Endpoint from a model hosted by Hugging Face. The target model must be part of the vLLM supported models list found here 🔗 .

A Secret containing a Hugging Face Access Token 🔗 is required in order to serve an Inference Endpoint. It will be passed to the --hf-token-secret flag below:

Terminal window
flexai inference serve <inference_endpoint_name>
[ --accels <number_of_accelerators> ]
[ --api-key-secret <name_of_secret_containing_the_api_key> ]
[ --device-arch <device_architecture> ]
[ --hf-token-secret <name_of_secret_containing_the_hugging_face_token> ]
[ --max-replicas <max_replicas> ]
[ --min-replicas <min_replicas> ]
[ --no-queuing ]
( -- --model=<model_name> [<vLLM_arguments>...] )

<vLLM_arguments> refers to a list of vLLM Engine Arguments can be passed to the command after the End-of-options marker (--). The list of supported arguments can be found in the next section.

Required

The name of the Inference Endpoint to create.

Examples
  • mixtral_8x7b
--
<string>
Required
String

The End-of-options marker.

Everything after this marker is passed to the vLLM Engine.

Examples
  • --
model_name
<string>
Required
Optional
-a , --accels
<integer>
Optional
Default Value: 1
Integer

Number of accelerators to use for the workload.

Examples
  • --accels 4
-d , --device-arch
<option_list>
Optional
Default Value: nvidia
Option list
  • nvidia
Examples
  • --device-arch nvidia
Optional

The name of a FlexAI Secret containing the API key you want to set to protect the Inference Endpoint.

If not provided:

  1. A FlexAI Secret will be created. Its value will be prompted ─ only once ─ after the creation of the Inference Endpoint.
  2. A new secret with the name <inference_endpoint_name>-api-key containing the auto-generated API key will be created.
Examples
  • --api-key-secret ENDPOINT_ACCESS_TOKEN
Optional
Integer

The maximum number of replicas to use for the Inference Endpoint.

Examples
  • --max-replicas 4
Optional
Integer

The minimum number of replicas to use for the Inference Endpoint.

Examples
  • --min-replicas 4
--no-queuing
<boolean>
Optional
Flag

Disable queuing for the Inference Endpoint.

This means that if there are not enough resources available in the cluster, the request will be rejected immediately instead of being queued.

Examples
  • --no-queuing
Required

The name of the FlexAI Secret containing the Hugging Face token that will be used to access the model.

You can visit the Hugging Face Hub 🔗 to create a token.

Examples
  • --hf-token-secret HF_TOKEN_PROD

Keep in mind that some models are “Gated”, meaning that you need to go through a process of agreeing to their license agreement, privacy policy, or similar before you can use them.

You can visit the model’s page on the Hugging Face Hub to see if it is marked as “Gated”. Gated models can be identified by this symbol: Hugging Face's "Gated model" indicator symbol. If the model is “Gated”, you will find the necessary information on how to proceed. Example:

Mixtral-8x7B-v0.1 model page on Hugging Face Hub: Before being granted access Mixtral-8x7B-v0.1 model page on Hugging Face Hub: Before being granted access

If you have already gone through the process, you will find a badge on the model’s page indicating that you have access to the model. Example:

Mixtral-8x7B-v0.1 model page on Hugging Face Hub: After being granted access Mixtral-8x7B-v0.1 model page on Hugging Face Hub: After being granted access