Creating a Training Job
With a Dataset available on your FlexAI account, you can now run a Training Job using the flexai training run
command.
A Training Job requires at least a name, a Dataset name, a link to a GitHub repository with the training code, the revision (branch name) where the code resides, and the path to the entry point script that will initiate the Training Job.
In addition, the entry point script can be followed by any arguments you want to pass to it, such as configurations and Hyperparameters.
The Model’s repository
Section titled “The Model’s repository”For this tutorial we will use the FlexAI fork of the nanoGPT repository originally created by Andrej Karpathy: https://github.com/flexaihq/nanogpt 🔗.
nanoGPT
’s entry point script path is train.py
, and it expects a path to a configuration file as well as a few Hyperparameters to run the Training Job.
Entry Point script arguments
Section titled “Entry Point script arguments”The entry point script for this Training Job is train.py
, and it expects the following arguments:
config/train_shakespeare_char.py
:A configuration file, which contains the default Training Parameters.--dataset_dir
: The path within the/input
directory of the Training Runtime where the Dataset files are located.--out_dir
: The output directory, which will be mounted into the Training Runtime as/output-checkpoint
.--max_iters
: The maximum number of iterations to run the training script for (optional).
Entry Point script arguments details
These include any Environment Settings and Hyperparameters the training script may require. For this tutorial:
Parameter | Type | Description |
---|---|---|
config/train_shakespeare_char.py | Environment Setting | A positional argument pointing to a configuration file used by nanoGPT’s train.py script to set the default Training Parameters |
--out_dir=/output-checkpoint | Environment Setting | The output directory where the training script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be /output-checkpoint |
--max_iters=1500 | Hyperparameter | The maximum number of iterations to run the training script for. This is an optional hyperparameter that can be used to tweak the Training Job execution |
FlexAI’s training run
Command
Section titled “FlexAI’s training run Command”Putting it all together, the command to run the Training Job looks like this:
flexai training run quickstart-training-job \ --dataset nanoGPT-dataset=my_dataset \ --repository-url https://github.com/flexaihq/nanogpt \ -- train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500
Zooming into the flexai training run
flags
These are the flags we’ll use to run the Quickstart Tutorial’s Training Job:
Argument / Flag | Value | Description |
---|---|---|
Training Job Name | quickstart-training-job | The name of the Training Job |
Dataset Name | nanoGPT-dataset=my_dataset | The name of the Dataset to be used for the Training Job name followed by a custom name to use when mounting the dataset files into the /input path: /input/my_dataset |
Repository URL | https://github.com/flexaihq/nanogpt | The URL of the GitHub repository containing the training code |
Entry Point Script | train.py | The path of entry point training script as defined by the repository |
A Training Job’s Checkpoints
Section titled “A Training Job’s Checkpoints”Once the Training Job is running, every time its code calls the torch.save()
function, FlexAI’s Managed Checkpoints feature will automatically capture a Checkpoint and store it in the /output-checkpoint
directory.
Each Checkpoint will be assigned a unique ID and its creation time will be recorded.
This means that you can go to a specific point in time and retrieve the state of the model at that moment, allowing you to resume training from that point or evaluate the model’s performance on a validation dataset.
After a Training Job completes, the last Checkpoint will be the one with the most recent creation timestamp.
Up next
Section titled “Up next”Next you’ll learn how to get a Training Job’s details and monitor its progress.