Deploying an Inference Endpoint for a private model
The process of Deploying an Inference Endpoint for a model on FlexAI is straightforward and can even be as simple as entering the name of the model you want to deploy.
However, if you want to deploy a private model from Hugging Face or a model that requires you to agree to their terms & conditions, you will need to provide your a Hugging Face Access Token that you can create via the Hugging Face website.
This quickstart tutorial guides you through the generation of a Hugging Face Access token to create an Inference Endpoint for the facebook/opt-125m
model hosted on Hugging Face π.
-
Create a Hugging Face Access Token via the Hugging Face website π.
-
Store your Hugging Face Access Token using the FlexAI Secret Manager.
Terminal window flexai secret create hf-access-token -
Create an Inference endpoint for the model you want to deploy.
Terminal window flexai inference serve fbopt125 \--hf-token-secret hf-access-token \-- --model=facebook/opt-125m -
Store the API Key youβre prompted with. You will need it to authenticate your requests to the Inference Endpoint.
flexai inference serve fbopt125... Inference "fbopt125" started successfullyHere is your API Key. Please store it safely: 0ad394db-8dba-4395-84d9-c0db1b1be0a8. You can reference it with the name: fbopt125-api-key -
Run
flexai inference list
to follow the status of your Inference Endpoint and get its URL.Terminal window flexai inference listflexai inference list NAME β STATUS β AGE β ENDPOINTββββββββββββΌβββββββββββββββββββΌββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββfbopt125 β starting β 12s β https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai -
Use the Inference Endpoint URL to make requests to your model.
Terminal window curl -X POST https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai/v1/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer 0ad394db-8dba-4395-84d9-c0db1b1be0a8" \-d '{"model": "facebook/opt-125m","prompt": "What is love?","max_tokens": 18,"temperature": 0}'Notice how you can add any additional parameters to the request, such as
temperature
,top_p
,top_k
, etc. This allows you to customize the behavior of the model during inference.curl response (rendered as JSON by using jq) {"id": "cmpl-a9bad9be-e3c3-4999-bd5c-5d64490caa87","object": "text_completion","created": 1748238500,"model": "facebook/opt-125m","choices": [{"index": 0,"text": "\nLove is the feeling of being loved.","logprobs": null,"finish_reason": "length","stop_reason": null,"prompt_logprobs": null}],"usage": {"prompt_tokens": 5,"total_tokens": 14,"completion_tokens": 9,"prompt_tokens_details": null}}
These steps remain the same for any private model you want to deploy using FlexAI. You just need to replace the model name in the --model
argument with the name of the model you want to deploy.
Inference Playground
Section titled βInference PlaygroundβAfter a successful deployment, you can run the flexai inference inspect
command to get the URL for your FlexAI hosted βPlaygroundβ environment. This is a https://chainlit.io/ π based UI that allows you to interact with your deployed model in a user-friendly way.
flexai inference inspect fbopt125
metadata: name: fbopt125 id: ... creatorUserID: ... ownerOrgID: ...config: device: nvidia accelerator: 1 apiKeySecretName: fbopt125-api-key endpointUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai playgroundUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-playgro-d41b1b44.platform.flex.ai // ...
When you follow the playgroundUrl
, you will get a Chainlit UI where you can interact with your deployed model. You can use this UI to test your model with various types of inputs (images, videos, files, or just text, depending on the model).
