Skip to content

Deploying an Inference Endpoint for a private model

The process of Deploying an Inference Endpoint for a model on FlexAI is straightforward and can even be as simple as entering the name of the model you want to deploy.

However, if you want to deploy a private model from Hugging Face or a model that requires you to agree to their terms & conditions, you will need to provide your a Hugging Face Access Token that you can create via the Hugging Face website.

This quickstart tutorial guides you through the generation of a Hugging Face Access token to create an Inference Endpoint for the facebook/opt-125m model hosted on Hugging Face πŸ”—.

  1. Create a Hugging Face Access Token via the Hugging Face website πŸ”—.

  2. Store your Hugging Face Access Token using the FlexAI Secret Manager.

    Terminal window
    flexai secret create hf-access-token
  3. Create an Inference endpoint for the model you want to deploy.

    Terminal window
    flexai inference serve fbopt125 \
    --hf-token-secret hf-access-token \
    -- --model=facebook/opt-125m
  4. Store the API Key you’re prompted with. You will need it to authenticate your requests to the Inference Endpoint.

    flexai inference serve fbopt125...
    Inference "fbopt125" started successfully
    Here is your API Key. Please store it safely: 0ad394db-8dba-4395-84d9-c0db1b1be0a8
    . You can reference it with the name: fbopt125-api-key
  5. Run flexai inference list to follow the status of your Inference Endpoint and get its URL.

    Terminal window
    flexai inference list
    flexai inference list
    NAME β”‚ STATUS β”‚ AGE β”‚ ENDPOINT
    ───────────┼──────────────────┼─────┼──────────────────────────────────────────────────────────────────────────────────
    fbopt125 β”‚ starting β”‚ 12s β”‚ https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai
  6. Use the Inference Endpoint URL to make requests to your model.

    Terminal window
    curl -X POST https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai/v1/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer 0ad394db-8dba-4395-84d9-c0db1b1be0a8" \
    -d '{
    "model": "facebook/opt-125m",
    "prompt": "What is love?",
    "max_tokens": 18,
    "temperature": 0
    }'

    Notice how you can add any additional parameters to the request, such as temperature, top_p, top_k, etc. This allows you to customize the behavior of the model during inference.

    curl response (rendered as JSON by using jq)
    {
    "id": "cmpl-a9bad9be-e3c3-4999-bd5c-5d64490caa87",
    "object": "text_completion",
    "created": 1748238500,
    "model": "facebook/opt-125m",
    "choices": [
    {
    "index": 0,
    "text": "\nLove is the feeling of being loved.",
    "logprobs": null,
    "finish_reason": "length",
    "stop_reason": null,
    "prompt_logprobs": null
    }
    ],
    "usage": {
    "prompt_tokens": 5,
    "total_tokens": 14,
    "completion_tokens": 9,
    "prompt_tokens_details": null
    }
    }

These steps remain the same for any private model you want to deploy using FlexAI. You just need to replace the model name in the --model argument with the name of the model you want to deploy.

After a successful deployment, you can run the flexai inference inspect command to get the URL for your FlexAI hosted β€œPlayground” environment. This is a https://chainlit.io/ πŸ”— based UI that allows you to interact with your deployed model in a user-friendly way.

Terminal window
flexai inference inspect fbopt125
flexai inference inspect fbopt125
metadata:
name: fbopt125
id: ...
creatorUserID: ...
ownerOrgID: ...
config:
device: nvidia
accelerator: 1
apiKeySecretName: fbopt125-api-key
endpointUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai
playgroundUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-playgro-d41b1b44.platform.flex.ai
// ...

When you follow the playgroundUrl, you will get a Chainlit UI where you can interact with your deployed model. You can use this UI to test your model with various types of inputs (images, videos, files, or just text, depending on the model).

Screenshot of the 'Sign in with GitHub' page