Running a Training Job

Overview

This Quickstart tutorial will walk you through the steps needed to train a model on FlexAI.

By the end of this tutorial, you will have successfully trained a version of nanoGPT 🔗 that has been optimized to make it readily usable. This enhanced version is available at: https://github.com/flexaihq/nanoGPT 🔗.

The process of training a model consists of 4 main steps:

1. Loading a Dataset

A Dataset is the collection of files that you want to use to train your model. You can upload files from your local machine or sync them in from a remote location when hosted by a third party Storage Provider (e.g., S3, GCS, R2, etc.). The Dataset files can be in any format, such as text, images, or audio.

2. Running a Training Job

Training Job is the name assigned to the FlexAI component that represents the process of executing training code on the FlexAI platform.

Creating a Training Job requires a name, at least one Dataset, a link to a GitHub repository with the training code, and the path to the entry point script (the file that contains the code that will be executed when the Training Job begins execution).

A number of optional flags allow for the customization of a Training Job’s execution, such as:

Specifying how many accelerators (GPUs) to use and across how many nodes.
Setting Environment Variables and Secrets that will then be passed to the Training Runtime so they can be used by the training scripts.
Specifying a previous checkpoint to resume execution from.
Setting a specific revision (branch, tag, or commit) of the code repository to use.

Additionally, any number of Hyperparameters can be specified passed to the entry point script. These Hyperparameters can be used to control the behavior of the training code, such as the learning rate, batch size, or number of epochs.

3. Getting the Training Job’s details

Once the Training Job is running, you can monitor its progress, view logs, evaluate its performance and resource usage, all through the FlexAI CLI, FlexAI’s hosted TensorBoard, and FlexAI’s Dashboard UI

4. Fetching the Training Job’s output

The output of a Training Job is the result of the training process, which can include:

Checkpoints: Saved states of the model at different points during training.
Files written to the /output directory by the training scripts, such as binaries, logs, metrics, or any other files that you want to keep as a result of a successful Training Job completion.

Prerequisites

You should have the FlexAI CLI installed on your system. If you haven’t done so yet, please follow the installation instructions.