Skip to content

Deploying an Inference Endpoint

We’ll get started by deploying a model from the Hugging Face Hub that does not require us to provide a Hugging Face Access Token —This will allow us to focus on the deployment process itself for now. We’ll be using https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 🔗.

  1. Log into https://console.flex.ai 🔗 using your FlexAI account credentials.

  2. Navigate to the Inference section from either the navigation bar or the card on the home page.

  3. A drawer menu with the creation form will appear automatically. You can also select the New button to display the creation form.

The Launch Inference form consists of a set of required and optional fields that you can use to customize your deployment.

To deploy an Inference Endpoint, model that does not require a huggingface Access Token —such as TinyLlama/TinyLlama-1.1B-Chat-v1.0—, you need to fill in the following required fields:

  • Name: A unique name for your inference endpoint. This will be used to identify your endpoint in the FlexAI console. It must follow the FlexAI resource naming conventions.
  • Hugging Face Model: The name of the model to deploy. In this case, it will be TinyLlama/TinyLlama-1.1B-Chat-v1.0.
  • Cluster: The cluster where the Training workload will run. It can be selected from a dropdown list of available clusters in your FlexAI account.
FieldValue
Namequickstart-inference-tinyLlama
Hugging Face ModelTinyLlama/TinyLlama-1.1B-Chat-v1.0
ClusterYour organization’s designated cluster

There are a few optional fields that you can use to customize your deployment:

  • Hugging Face Token: This is required if the model you want to deploy is private or requires authentication. Here you can select a FlexAI Secret where you’ve stored your Hugging Face Access Token 🔗.

  • API Key: A secret key that will be used to authenticate requests to your Inference endpoint. If left empty, a random API Key will be generated and will be displayed to you after you initiate the deployment process. Make sure to copy it and store it in a safe place, as you will not be able to see it again.

  • vLLM Parameters: A set of arguments that will be passed to vLLM . You can use this to customize the behavior of the vLLM server, such as setting the maximum number of tokens to generate, the temperature, and other parameters.

After filling out the form, select the Submit button to start the Inference Endpoint deployment.

You should get a confirmation window displaying the details of your Inference Endpoint, including the API Key that needs to be used to authenticate requests towards the endpoint.

After a few minutes, your Inference Endpoint should be up and running. In the next step of this Quickstart Tutorial, you will learn how to use it to make requests to the deployed model.