Overview
This Quickstart tutorial will walk you through the steps needed to train a model on FlexAI. By the end of this tutorial, you will have successfully trained a version of nanoGPT that has been optimized to make it readily usable. This enhanced version is available at: https://github.com/flexaihq/nanoGPT. The process of Fine-tuning a model from a starting poiont (Base model or Checkpoint) a model consists of 4 main steps:Prerequisites
You should have a FlexAI account. If you don’t have one, you can sign up for a free account.1. Loading a Dataset
A Dataset is the collection of files that you want to use to train your model. You can upload files from your local machine or sync them in from a remote location when hosted by a third party Storage Provider (e.g., S3, GCS, R2, etc.). The Dataset files can be in any format, such as text, images, or audio.2. Running a Fine-tuning Job
Fine-tuning Job is the name assigned to the FlexAI component that represents the process of executing training code on the FlexAI platform. Creating a Fine-tuning Job requires the following 5 things:- A Name that describes your Fine-tuning Job
- At least one Dataset that will be used to Fine-tune your model
- A link to a GitHub repository with the Fine-tuning code
- A Checkpoint name or ID, which to use to start the Fine-tuning Job from
- The path to the entry point script (the file that contains the code that will be executed when the Fine-tuning Job begins)
- Specifying how many accelerators (GPUs) to use and across how many nodes.
- Setting Environment Variables and Secrets that will then be passed to the Training Runtime so they can be used by the training scripts.
- Specifying a previous checkpoint to resume execution from.
- Setting a specific revision (branch, tag, or commit) of the code repository to use.
3. Getting the Fine-tuning Job’s details
Once the Fine-tuning Job is running, you can monitor its progress, view logs, evaluate its performance and resource usage, all through the FlexAI CLI, FlexAI’s hosted TensorBoard, and FlexAI’s Dashboard UI4. Fetching the Fine-tuning Job’s output
The output of a Fine-tuning Job is the result of the training process, which can include:- Checkpoints: Saved states of the model at different points during training.
- Files written to the
/outputdirectory by the training scripts, such as binaries, logs, metrics, or any other files that you want to keep as a result of a successful Fine-tuning Job completion.