Deploying an Inference Endpoint
The process of Deploying an Inference Endpoint for a model on FlexAI is straightforward and can even be as simple as entering the name of the model you want to deploy.
This quickstart tutorial will walk you through the steps to deploy an Inference Endpoint for the TinyLlama/TinyLlama-1.1B-Chat-v1.0
model hosted on Hugging Face π.
-
Create an Inference endpoint for the model you want to deploy.
Terminal window flexai inference serve tiny-llama \--hf-token-secret hf-access-token \-- --model=TinyLlama/TinyLlama-1.1B-Chat-v1.0 -
Store the API Key youβre prompted with. You will need it to authenticate your requests to the Inference Endpoint.
flexai inference serve tiny-llama... Inference "tiny-llama" started successfullyHere is your API Key. Please store it safely: 0ad394db-8dba-4395-84d9-c0db1b1be0a8. You can reference it with the name: tiny-llama-api-key -
Run
flexai inference list
to follow the status of your Inference Endpoint and get its URL.Terminal window flexai inference listflexai inference list NAME β STATUS β AGE β ENDPOINTββββββββββββΌβββββββββββββββββββΌββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββtiny-llama β starting β 32s β https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai -
Use the Inference Endpoint URL to make requests to your model.
Terminal window curl -X POST https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai/v1/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer 0ad394db-8dba-4395-84d9-c0db1b1be0a8" \-d '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0","messages": [{"role": "user", "content": "What is love?"}],"max_tokens": 256,}'curl response (rendered as JSON by using jq) {"id": "chatcmpl-e6c2d199-b57d-4904-a1cb-89d604b146d9","object": "chat.completion","created": 1748238665,"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0","choices": [{"index": 0,"message": {"role": "assistant","reasoning_content": null,"content": "Love is a powerful and meaningful emotion that brings people close together. It is a connection between two individuals that transcends time and space, supporting and nurturing a deep and deepening relationship. Love is the foundation that holds relationships together, reflecting in the happiness, relationships, and harmony surrounding us in every aspect of life. To understand love and its complexities, one would need to delve deep into the arena of psychology, philosophy, anthropology, literature, and social sciences. It is an elusive but natural human reaction, serving as an enabling force for the bonding of our species, and it is an integral aspect of human nature.","tool_calls": []},"logprobs": null,"finish_reason": "stop","stop_reason": null}],"usage": {"prompt_tokens": 20,"total_tokens": 162,"completion_tokens": 142,"prompt_tokens_details": null},"prompt_logprobs": null,"kv_transfer_params": null} -
Enjoy your AI-powered application!
Inference Playground
Section titled βInference PlaygroundβAfter a successful deployment, you can run the flexai inference inspect
command to get the URL for your FlexAI hosted βPlaygroundβ environment. This is a https://chainlit.io/ π based UI that allows you to interact with your deployed model in a user-friendly way.
flexai inference inspect tiny-llama
metadata: name: tiny-llama id: ... creatorUserID: ... ownerOrgID: ...config: device: nvidia accelerator: 1 apiKeySecretName: tiny-llama-api-key endpointUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai playgroundUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-playgro-d41b1b44.platform.flex.ai // ...
When you follow the playgroundUrl
, you will get a Chainlit UI where you can interact with your deployed model. You can use this UI to test your model with various types of inputs (images, videos, files, or just text, depending on the model).
