Skip to content

Fine-tuning a model with FlexAI

This Quickstart tutorial will walk you through the steps needed to train a model on FlexAI.

By the end of this tutorial, you will have successfully trained a version of nanoGPT 🔗 that has been optimized to make it readily usable. This enhanced version is available at: https://github.com/flexaihq/nanoGPT 🔗.

The process of Fine-tuning a model from a starting poiont (Base model or Checkpoint) a model consists of 4 main steps:

You should have a FlexAI account. If you don’t have one, you can sign up for a free account.

A Dataset is the collection of files that you want to use to train your model. You can upload files from your local machine or sync them in from a remote location when hosted by a third party Storage Provider (e.g., S3, GCS, R2, etc.). The Dataset files can be in any format, such as text, images, or audio.

Fine-tuning Job is the name assigned to the FlexAI component that represents the process of executing training code on the FlexAI platform.

Creating a Fine-tuning Job requires the following 5 things:

  1. A Name that describes your Fine-tuning Job
  2. At least one Dataset that will be used to Fine-tune your model
  3. A link to a GitHub repository with the Fine-tuning code
  4. A Checkpoint name or ID, which to use to start the Fine-tuning Job from
  5. The path to the entry point script (the file that contains the code that will be executed when the Fine-tuning Job begins)

A number of optional flags allow for the customization of a Fine-tuning Job’s execution, such as:

  • Specifying how many accelerators (GPUs) to use and across how many nodes.
  • Setting Environment Variables and Secrets that will then be passed to the Training Runtime so they can be used by the training scripts.
  • Specifying a previous checkpoint to resume execution from.
  • Setting a specific revision (branch, tag, or commit) of the code repository to use.

Additionally, any number of Hyperparameters can be specified passed to the entry point script. These Hyperparameters can be used to control the behavior of the training code, such as the learning rate, batch size, or number of epochs.

3. Getting the Fine-tuning Job’s details

Section titled “3. Getting the Fine-tuning Job’s details”

Once the Fine-tuning Job is running, you can monitor its progress, view logs, evaluate its performance and resource usage, all through the FlexAI CLI, FlexAI’s hosted TensorBoard, and FlexAI’s Dashboard UI

4. Fetching the Fine-tuning Job’s output

Section titled “4. Fetching the Fine-tuning Job’s output”

The output of a Fine-tuning Job is the result of the training process, which can include:

  • Checkpoints: Saved states of the model at different points during training.
  • Files written to the /output directory by the training scripts, such as binaries, logs, metrics, or any other files that you want to keep as a result of a successful Fine-tuning Job completion.