Preview Release

Command: `inference`

The flexai inference command manages Inference Endpoints by allowing for their creation, listing, stopping, and output logs retrieval.

inference `delete`

Deletes an Inference Endpoint. The Inference Endpoint must be stopped before it can be deleted.

flexai inference delete <inference_endpoint_name>

inference `inspect`

Displays detailed information about an Inference Endpoint, including its name, ID, creator, owner, configuration, runtime status, creation date, and its URL.

flexai inference inspect <inference_endpoint_name>

Flags

Flag	Type	Optional / Required	Description	Example
`--json`	Flag	Optional	Output the information in JSON format	`--json`

Details on the returned information

Returned information

Field	Description	Data Type
`kind`	The type of resource	String
`metadata`	Metadata information about the resource	Object
`metadata.name`	The name of the Inference Endpoint	String
`metadata.id`	The ID of the Inference Endpoint	String
`metadata.creatorUserID`	The ID of the user who created the Inference Endpoint	String
`metadata.ownerOrgID`	The ID of the organization that owns the Inference Endpoint	String
`config`	Configuration information about the Inference Endpoint	Object
`config.device`	The desired architecture for the Inference Endpoint	String
`config.accelerator`	The number of accelerators per server	Integer
`config.apiKeySecretName`	The name of the secret containing the API key for the Inference Endpoint	String
`config.endpointUrl`	The URL of the Inference Endpoint's URL	String
`config.hfTokenSecretName`	The name of the secret containing the Hugging Face token	String
`config.engineArgs`	Additional arguments specific to vLLM	Object
`config.engineArgs.model`	The name of the model being served	String
`runtime`	Runtime information about the Inference Endpoint	Object
`runtime.status`	The status of the Inference Endpoint	String
`runtime.createdAt`	The creation date of the Inference Endpoint	String

inference `list`

Lists all the Inference Endpoints that have been created. The output includes the name of the Inference Endpoint, its status, age, and URL.

flexai inference list

Example

                   NAME                  |      STATUS      | AGE |                                            ENDPOINT                                              
-----------------------------------------+------------------+-----+--------------------------------------------------------------------------------------------------
  inference-llm-ms-phi-1                 | stopped          | 4d  | https://inference-60150fa3-e8dd-45e0-9a12-ec827046e10e-1bd4b3a9.platform.staging.flexsystems.ai  
  mixtral_8x7b                           | running          | 14m | https://inference-e7d81fe6-fa49-4ace-bbbe-81744c465d27-e76aaec3.platform.staging.flexsystems.ai  
  text_facebook/opt-125m                 | running          | 7m  | https://inference-a25963fa-ae2d-4b4f-81ad-06a3e0d6a59c-2d06cca7.platform.staging.flexsystems.ai 

inference `logs`

Displays a stream of logs from an Inference Endpoint. The logs include information about the deployment's status, the model being served, and the requests being processed.

flexai inference logs <inference_endpoint_name>

inference `serve`

Creates an Inference Endpoint from a model hosted by Hugging Face. The target model must be part of the vLLM supported models list found here.

A Secret containing a Hugging Face token is required in order to serve an Inference Endpoint. It will be passed to the --hf-token-secret flag below:

flexai inference serve <inference_endpoint_name> \
  --hf-token-secret <name_of_secret_containing_the_hugging_face_token> \
  -- --model=<model_name> [<vLLM_arguments>...]  

<vLLM_arguments> refers to a list of vLLM Engine Arguments can be passed to the command after the End-of-options marker (--).

note

Note: <model_name> is made out of the organization name and the model identifier itself: <organization>/<model_identifier>. For instance:

mistralai/Mixtral-8x7B-v0.1
microsoft/phi-1_5

Arguments

Argument	Description	Example
`inference_endpoint_name`	Resource name. Must follow the FCS resource naming conventions	`mixtral_8x7b`
`--`	End-of-options marker. The `vLLM_arguments` should be passed after this marker.	`--`
`model_name`	The name of the model to be served. Must be part of the vLLM supported models list. Must follow the `<organization>/<model_identifier>` pattern	`mistralai/Mixtral-8x7B-v0.1`
`vLLM_arguments`	Additional arguments specific to the vLLM engine. Note that the `--device` argument is not supported, since FlexAI handles the device selection tasks	`--task`, `--enable-lora`, `--seed`

Flags

Flag	Type	Optional / Required	Description	Example
`-a, --accels`	Integer	Optional	Number of accelerators per server. Default is 1.	`--accels 2`
`-K, --api-key-secret`	String	Optional	Name of the secret containing API key for the inference Endpoint. If not provided, a secret will be automatically generated and displayed ─ only once ─ after the creation of the Inference Endpoint. A new secret with the name `<inference_endpoint_name>-api-key` will be created with the provided API key as its value.	`--api-key-secret ENDPOINT_ACCESS_TOKEN`
`-d, --device-arch`	String	Optional	Desired architecture. Default is `nvidia`.	`--device-arch nvidia`
`-T, --hf-token-secret`	String	Required	Name of the secret containing the Hugging Face token.	`--hf-token-secret HF_TOKEN_PROD`
`--no-queuing`	Flag	Optional	Do not queue the request if no agent is available.	`--no-queuing`

Gated models

Keep in mind that some models are "Gated", meaning that you need to go through a process of agreeing to their license agreement, privacy policy, or similar before you can use them.

You can visit the model's page on the Hugging Face Hub to see if it is marked as "Gated". Gated models can be identified by this symbol: . If the model is "Gated", you will find the necessary information on how to proceed. Example:

Mixtral-8x7B-v0.1 model page on Hugging Face Hub: Before being granted access

If you have already gone through the process, you will find a badge on the model's page indicating that you have access to the model. Example:

Mixtral-8x7B-v0.1 model page on Hugging Face Hub: After being granted access

note

Make sure you go through the process using the same Hugging Face account that owns the Access Token stored in the Secret passed to the --hf-token-secret flag.

inference `resume`

Resumes a previously stopped Inference Endpoint.

flexai inference resume <inference_endpoint_name>

inference `stop`

Stops an Inference Endpoint.

flexai inference stop <inference_endpoint_name>

inference delete​

inference inspect​

Flags​

Returned information​

inference list​

Example​

inference logs​

inference serve​

Arguments​

Flags​

Gated models​

inference resume​

inference stop​

inference `delete`

inference `inspect`

Flags

Returned information

inference `list`

Example

inference `logs`

inference `serve`

Arguments

Flags

Gated models

inference `resume`

inference `stop`