Skip to main content

Quickstart Tutorial

Overview

This Quickstart tutorial will walk you through the steps needed to train a model on FlexAI Cloud Services.

By the end of this tutorial, you will have successfully trained a version of nanoGPT that has been optimized to make it readily usable. This enhanced version is available at: https://github.com/flexaihq/nanoGPT.

The process of training a model consists of 2 main steps:

1. Loading a Dataset

A Dataset is the collection of files that you want to use to train your model. You can upload files from your local machine or sync them in from a remote location when hosted by a third party Storage Provider (e.g., S3, GCS, R2, etc.). The Dataset files can be in any format, such as text, images, or audio.

2. Running a Training Job

Training Job is the name assigned to the FlexAI component that represents the process of executing training code on the FlexAI platform.

Creating a Training Job requires a name, at least one Dataset, a link to a GitHub repository with the training code, and the path to the entry point script (the file that contains the code that will be executed when the Training Job begins execution).

A number of optional flags allow for the customization of a Training Job's execution, such as:

  • Specifying how many accelerators (GPUs) to use and across how many nodes.
  • Setting Environment Variables and Secrets that will then be passed to the Training Runtime so they can be used by the training scripts.
  • Specifying a previous checkpoint to resume execution from.
  • Setting a specific revision (branch, tag, or commit) of the code repository to use.

Additionally, any number of Hyperparameters can be specified passed to the entry point script. These Hyperparameters can be used to control the behavior of the training code, such as the learning rate, batch size, or number of epochs.

Prerequisites

You should have the FlexAI CLI installed on your system. If you haven't done so yet, please follow the installation instructions.