FlexAI Inference - Public Model
Weβll get started by creating an Inference Endpoint for a public model hosted in the Hugging Face Hub that does not require us to provide a Hugging Face Access Token βThis will allow us to focus on the deployment process itself for now.
Weβll be using https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 π.
Creating an Inference Endpoint
Section titled βCreating an Inference Endpointβ- Navigate to the Inference section from either the navigation bar or the card on the home page.
- Select the "+ New" button to display the "Launch Inference" panel.
- Fill out the Launch Inference form according to the instructions below.
The Launch Inference form
The Launch Inference form consists of a set of required and optional fields that you can use to customize your deployment.
Required Fields
- Name: A unique name for your inference endpoint. This will be used to identify your endpoint in the FlexAI console. It must follow the FlexAI resource naming conventions.
- Hugging Face Model: The name of the model to deploy.
- Cluster: The cluster where the Training workload will run. It can be selected from a dropdown list of available clusters in your FlexAI account.
Form Values
| Field | Value |
|---|---|
| Name | quickstart-inference-tinyLlama |
| Hugging Face Model | TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| Cluster | Your organization's designated cluster |
Other fields
There are a few optional fields that you can use to customize your deployment:
- Hugging Face Token: Only required if the model you want to deploy is private or requires authentication.
- API Key: A secret key that will be used to authenticate requests to your Inference endpoint. If left empty, a random API Key will be generated and displayed to you after you initiate the deployment process. Make sure to copy it and store it in a safe place, as you will not be able to see it again.
- vLLM Parameters: A set of arguments that will be passed to vLLM. You can use this to customize the behavior of the vLLM server, such as setting the maximum number of tokens to generate, the temperature, and other parameters.
Starting the Inference Endpoint
After filling out the form, select the Submit button to start the Inference Endpoint deployment.
You should get a confirmation window displaying the details of your Inference Endpoint, including the API Key that needs to be used to authenticate requests towards the endpoint.
-
Create a new Inference endpoint.
Terminal window flexai inference serve tiny-llama \--hf-token-secret hf_token \-- --model=TinyLlama/TinyLlama-1.1B-Chat-v1.0 -
Store the API Key youβre prompted with. You will need it to authenticate your requests to the Inference Endpoint.
flexai inference serve tiny-llama... Inference endpoint "tiny-llama" started successfullyHere is your API Key. Please store it safely: 0ad394db-8dba-4395-84d9-c0db1b1be0a8You can reference it with the secret name: tiny-llama-api-key -
Run
flexai inference listto follow the status of your Inference Endpoint and get its URL.Terminal window flexai inference listflexai inference list NAME β STATUS β AGE β ENDPOINTββββββββββββΌβββββββββββββββββββΌββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββtiny-llama β starting β 32s β https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai
Checking the status of your Inference Endpoint
Section titled βChecking the status of your Inference EndpointβAfter a few minutes, your Inference Endpoint should be up and running.
Inference Endpoint Details
You can select the gear icon βοΈ (labeled as Configure) in the Actions field of the Inference Endpoint list row of your newly created Endpoint to open the details panel of the Inference Endpoint deployment.
The Details tab will be opened by default, showing you all the relevant information about your Inference Endpoint.
This tab provides you with detailed information about your Inference Endpoint, including:
The Summary tab
| Field | Description |
|---|---|
ID | The unique identifier of the Inference Endpoint. |
Name | The name you assigned to the Inference Endpoint. |
Status | The current status of the Inference Endpoint (e.g., Running, Stopped, etc.). |
URL | The base URL of the Inference Endpoint, which you can use to query the model. |
Playground URL | The URL of the Inference Playground, a user-friendly interface to interact with your deployed model. |
Dashboard URL | The URL of the Inference Endpoint dashboard, where you can monitor the performance and usage of your model. |
Configuration
| Field | Description |
|---|---|
Device Architecture | The architecture of the device where the Inference Endpoint is running (e.g., nvidia). |
Runtime Args | The vLLM runtime arguments that were used to deploy the Inference Endpoint. These can be customized when creating or updating the Inference Endpoint. |
HF Token Secret Name | The name of the FlexAI Secret that contains the Hugging Face Access Token, if applicable. This is only shown if the Inference Endpoint requires a Hugging Face Access Token to access the model. |
API Key Secret Name | The name of the FlexAI Secret that contains the API Key used to authenticate requests to the Inference Endpoint. |
The Activity tab
The Activity tab provides you with a timeline of events related to your Inference Endpoint, including deployment status changes, scaling events, and more.
The Logs tab
The Logs tab provides you with real-time logs from your Inference Endpoint, allowing you to monitor its activity and troubleshoot any issues that may arise.
You can use the Search bar input field to filter the logs by a specific keyword. This is useful to quickly find relevant information in the logs.
flexai inference list
The flexai inference list command offers a quick overview of the current status and HTTP endpoint of your Inference Endpoints.
flexai inference listOutput:
NAME β STATUS β AGE β ENDPOINTββββββββββββΌβββββββββββββββββββΌββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββtiny-llama β starting β 32s β https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.aiflexai inference inspect
The flexai inference inspect command provides detailed information about a specific Inference Endpoint, including its current status, configuration, and lifecycle.
flexai inference inspect tiny-llamaOutput:
kind: Inferencemetadata: name: tiny-llama id: e0ed242b-dbd3-423e-b525-76beac222d44 creatorUserID: 16e289cc-c81b-4a15-91d9-0e2aae00a317 ownerOrgId: 270a5476-b91a-442f-8a13-852ef7bb5b9cconfig: device: nvidia accelerator: 1 apiKeySecretName: tiny-llama-api-key endpointUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-7de0c0f4.platform.flex.ai playgroundUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-playgro-310c6738.platform.flex.ai hfTokenSecretName: "" engineArgs: model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 runtime: ""runtime: status: running createdAt: "2025-06-30T11:41:54Z" selectedAgentId: k8s-training-sesterce-001-CLIENT-PROD-client-prod queuePosition: 0 lifecycleEvents: - type: Execution status: Enqueued message: "" raisedAt: "2025-06-30T11:42:14Z" - type: Execution status: Started message: "" raisedAt: "2025-06-30T11:42:15Z"{ "kind": "Inference", "metadata": { "name": "tiny-llama", "id": "e0ed242b-dbd3-423e-b525-76beac222d44", "creatorUserID": "16e2894c-c81b-4a15-91d9-0e2aae00a317", "ownerOrgID": "108dddec-e922-49b8-a466-4d7ed5dcc746" }, "config": { "device": "nvidia", "accelerator": 1, "apiKeySecretName": "tiny-llama-api-key", "endpointUrl": "https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-7de0c0f4.platform.flex.ai", "playgroundUrl": "https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-playgro-310c6738.platform.flex.ai", "hfTokenSecretName": "", "engineArgs": { "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0" }, "runtime": "" }, "runtime": { "status": "running", "createdAt": "2025-06-30T11:41:54Z", "selectedAgentId": "k8s-training-sesterce-001-CLIENT-PROD-client-prod", "queuePosition": 0, "lifecycleEvents": [ { "type": "Execution", "status": "Enqueued", "message": "", "raisedAt": "2025-06-30T11:42:14Z" }, { "type": "Execution", "status": "Started", "message": "", "raisedAt": "2025-06-30T11:42:15Z" } ] }}flexai inference logs
flexai inference logs tiny-llamaOutput:
[...][INFO] Starting inference for tiny-llama[INFO] Inference completed successfully[...]Managing your Inference Endpoint
Section titled βManaging your Inference EndpointβThe Inference Endpoints tableβs Actions column provides a set of actions that you can use to manage your Inference Endpoint:
- Configure: To access its Details panel.
- Pause: To temporarily stop the Inference Endpoint without deleting it.
- Delete: To permanently remove the Inference Endpoint.
- Resume: To restart a paused Inference Endpoint.
The FlexAI CLI provides a set of commands that you can use to manage your Inference Endpoints:
flexai inference delete- Deletes an Inference Endpoint.flexai inference scale- Allows for the definition of scaling policies for an Inference Endpoint.flexai inference stop- Stops an Inference Endpoint.