Skip to main content
Preview Release

Command: inference

The flexai inference command manages Inference Endpoints by allowing for their creation, listing, stopping, and output logs retrieval.

inference delete

Deletes an Inference Endpoint. The Inference Endpoint must be stopped before it can be deleted.

flexai inference delete <inference_endpoint_name>

inference inspect

Displays detailed information about an Inference Endpoint, including its name, ID, creator, owner, configuration, runtime status, creation date, and its URL.

flexai inference inspect <inference_endpoint_name>

Flags

FlagTypeOptional / RequiredDescriptionExample
--jsonFlagOptionalOutput the information in JSON format--json
Details on the returned information

Returned information

FieldDescriptionData Type
kindThe type of resourceString
metadataMetadata information about the resourceObject
metadata.nameThe name of the Inference EndpointString
metadata.idThe ID of the Inference EndpointString
metadata.creatorUserIDThe ID of the user who created the Inference EndpointString
metadata.ownerOrgIDThe ID of the organization that owns the Inference EndpointString
configConfiguration information about the Inference EndpointObject
config.deviceThe desired architecture for the Inference EndpointString
config.acceleratorThe number of accelerators per serverInteger
config.apiKeySecretNameThe name of the secret containing the API key for the Inference EndpointString
config.endpointUrlThe URL of the Inference Endpoint's URLString
config.hfTokenSecretNameThe name of the secret containing the Hugging Face tokenString
config.engineArgsAdditional arguments specific to vLLMObject
config.engineArgs.modelThe name of the model being servedString
runtimeRuntime information about the Inference EndpointObject
runtime.statusThe status of the Inference EndpointString
runtime.createdAtThe creation date of the Inference EndpointString

inference list

Lists all the Inference Endpoints that have been created. The output includes the name of the Inference Endpoint, its status, age, and URL.

flexai inference list

Example

                   NAME                  |      STATUS      | AGE |                                            ENDPOINT                                              
-----------------------------------------+------------------+-----+--------------------------------------------------------------------------------------------------
inference-llm-ms-phi-1 | stopped | 4d | https://inference-60150fa3-e8dd-45e0-9a12-ec827046e10e-1bd4b3a9.platform.staging.flexsystems.ai
mixtral_8x7b | running | 14m | https://inference-e7d81fe6-fa49-4ace-bbbe-81744c465d27-e76aaec3.platform.staging.flexsystems.ai
text_facebook/opt-125m | running | 7m | https://inference-a25963fa-ae2d-4b4f-81ad-06a3e0d6a59c-2d06cca7.platform.staging.flexsystems.ai

inference logs

Displays a stream of logs from an Inference Endpoint. The logs include information about the deployment's status, the model being served, and the requests being processed.

flexai inference logs <inference_endpoint_name>

inference serve

Creates an Inference Endpoint from a model hosted by Hugging Face. The target model must be part of the vLLM supported models list found here.

A Secret containing a Hugging Face token is required in order to serve an Inference Endpoint. It will be passed to the --hf-token-secret flag below:

flexai inference serve <inference_endpoint_name> \
--hf-token-secret <name_of_secret_containing_the_hugging_face_token> \
-- --model=<model_name> [<vLLM_arguments>...]

<vLLM_arguments> refers to a list of vLLM Engine Arguments can be passed to the command after the End-of-options marker (--).

note

Note: <model_name> is made out of the organization name and the model identifier itself: <organization>/<model_identifier>. For instance:

  • mistralai/Mixtral-8x7B-v0.1
  • microsoft/phi-1_5

Arguments

ArgumentDescriptionExample
inference_endpoint_nameResource name. Must follow the FCS resource naming conventionsmixtral_8x7b
--End-of-options marker. The vLLM_arguments should be passed after this marker.--
model_nameThe name of the model to be served. Must be part of the vLLM supported models list. Must follow the <organization>/<model_identifier> patternmistralai/Mixtral-8x7B-v0.1
vLLM_argumentsAdditional arguments specific to the vLLM engine. Note that the --device argument is not supported, since FlexAI handles the device selection tasks--task, --enable-lora, --seed

Flags

FlagTypeOptional / RequiredDescriptionExample
-a, --accelsIntegerOptionalNumber of accelerators per server. Default is 1.--accels 2
-K, --api-key-secretStringOptionalName of the secret containing API key for the inference Endpoint. If not provided, a secret will be automatically generated and displayed ─ only once ─ after the creation of the Inference Endpoint. A new secret with the name <inference_endpoint_name>-api-key will be created with the provided API key as its value.--api-key-secret ENDPOINT_ACCESS_TOKEN
-d, --device-archStringOptionalDesired architecture. Default is nvidia.--device-arch nvidia
-T, --hf-token-secretStringRequiredName of the secret containing the Hugging Face token.--hf-token-secret HF_TOKEN_PROD
--no-queuingFlagOptionalDo not queue the request if no agent is available.--no-queuing

Gated models

Keep in mind that some models are "Gated", meaning that you need to go through a process of agreeing to their license agreement, privacy policy, or similar before you can use them.

You can visit the model's page on the Hugging Face Hub to see if it is marked as "Gated". Gated models can be identified by this symbol: Hugging Face&#39;s &quot;Gated model&quot; indicator symbol. If the model is "Gated", you will find the necessary information on how to proceed. Example:

Mixtral-8x7B-v0.1 model page on Hugging Face Hub: Before being granted access Mixtral-8x7B-v0.1 model page on Hugging Face Hub: Before being granted access

If you have already gone through the process, you will find a badge on the model's page indicating that you have access to the model. Example:

Mixtral-8x7B-v0.1 model page on Hugging Face Hub: After being granted access Mixtral-8x7B-v0.1 model page on Hugging Face Hub: After being granted access

note

Make sure you go through the process using the same Hugging Face account that owns the Access Token stored in the Secret passed to the --hf-token-secret flag.

inference resume

Resumes a previously stopped Inference Endpoint.

flexai inference resume <inference_endpoint_name>

inference stop

Stops an Inference Endpoint.

flexai inference stop <inference_endpoint_name>