Skip to content

Deploying an Inference Endpoint

The process of Deploying an Inference Endpoint for a model on FlexAI is straightforward and can even be as simple as entering the name of the model you want to deploy.

This quickstart tutorial will walk you through the steps to deploy an Inference Endpoint for the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model hosted on Hugging Face πŸ”—.

  1. Create an Inference endpoint for the model you want to deploy.

    Terminal window
    flexai inference serve tiny-llama \
    --hf-token-secret hf-access-token \
    -- --model=TinyLlama/TinyLlama-1.1B-Chat-v1.0
  2. Store the API Key you’re prompted with. You will need it to authenticate your requests to the Inference Endpoint.

    flexai inference serve tiny-llama...
    Inference "tiny-llama" started successfully
    Here is your API Key. Please store it safely: 0ad394db-8dba-4395-84d9-c0db1b1be0a8
    . You can reference it with the name: tiny-llama-api-key
  3. Run flexai inference list to follow the status of your Inference Endpoint and get its URL.

    Terminal window
    flexai inference list
    flexai inference list
    NAME β”‚ STATUS β”‚ AGE β”‚ ENDPOINT
    ───────────┼──────────────────┼─────┼──────────────────────────────────────────────────────────────────────────────────
    tiny-llama β”‚ starting β”‚ 32s β”‚ https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai
  4. Use the Inference Endpoint URL to make requests to your model.

    Terminal window
    curl -X POST https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai/v1/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer 0ad394db-8dba-4395-84d9-c0db1b1be0a8" \
    -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [{"role": "user", "content": "What is love?"}],
    "max_tokens": 256,
    }'
    curl response (rendered as JSON by using jq)
    {
    "id": "chatcmpl-e6c2d199-b57d-4904-a1cb-89d604b146d9",
    "object": "chat.completion",
    "created": 1748238665,
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "choices": [
    {
    "index": 0,
    "message": {
    "role": "assistant",
    "reasoning_content": null,
    "content": "Love is a powerful and meaningful emotion that brings people close together. It is a connection between two individuals that transcends time and space, supporting and nurturing a deep and deepening relationship. Love is the foundation that holds relationships together, reflecting in the happiness, relationships, and harmony surrounding us in every aspect of life. To understand love and its complexities, one would need to delve deep into the arena of psychology, philosophy, anthropology, literature, and social sciences. It is an elusive but natural human reaction, serving as an enabling force for the bonding of our species, and it is an integral aspect of human nature.",
    "tool_calls": []
    },
    "logprobs": null,
    "finish_reason": "stop",
    "stop_reason": null
    }
    ],
    "usage": {
    "prompt_tokens": 20,
    "total_tokens": 162,
    "completion_tokens": 142,
    "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "kv_transfer_params": null
    }
  5. Enjoy your AI-powered application!

After a successful deployment, you can run the flexai inference inspect command to get the URL for your FlexAI hosted β€œPlayground” environment. This is a https://chainlit.io/ πŸ”— based UI that allows you to interact with your deployed model in a user-friendly way.

Terminal window
flexai inference inspect tiny-llama
flexai inference inspect tiny-llama
metadata:
name: tiny-llama
id: ...
creatorUserID: ...
ownerOrgID: ...
config:
device: nvidia
accelerator: 1
apiKeySecretName: tiny-llama-api-key
endpointUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-d41b1b44.platform.flex.ai
playgroundUrl: https://inference-e0ed242b-dbd3-423e-b525-76beac222d44-playgro-d41b1b44.platform.flex.ai
// ...

When you follow the playgroundUrl, you will get a Chainlit UI where you can interact with your deployed model. You can use this UI to test your model with various types of inputs (images, videos, files, or just text, depending on the model).

Screenshot of the 'Sign in with GitHub' page