Creating a Training Job

With a Dataset available on your FlexAI account, you can now run a Training Job using the flexai training run command.

A Training Job requires at least a name, a Dataset name, a link to a GitHub repository with the training code, the revision (branch name) where the code resides, and the path to the entry point script that will initiate the Training Job.

In addition, the entry point script can be followed by any arguments you want to pass to it, such as configurations and Hyperparameters.

The Model’s repository

For this tutorial we will use the FlexAI fork of the nanoGPT repository originally created by Andrej Karpathy: https://github.com/flexaihq/nanogpt 🔗.

nanoGPT’s entry point script path is train.py, and it expects a path to a configuration file as well as a few Hyperparameters to run the Training Job.

Entry Point script arguments

The entry point script for this Training Job is train.py, and it expects the following arguments:

config/train_shakespeare_char.py:A configuration file, which contains the default Training Parameters.
--dataset_dir: The path within the /input directory of the Training Runtime where the Dataset files are located.
--out_dir: The output directory, which will be mounted into the Training Runtime as /output-checkpoint.
--max_iters: The maximum number of iterations to run the training script for (optional).

Entry Point script arguments details

These include any Environment Settings and Hyperparameters the training script may require. For this tutorial:

Parameter	Type	Description
`config/train_shakespeare_char.py`	Environment Setting	A positional argument pointing to a configuration file used by nanoGPT’s `train.py` script to set the default Training Parameters
`--out_dir=/output-checkpoint`	Environment Setting	The output directory where the training script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be `/output-checkpoint`
`--max_iters=1500`	Hyperparameter	The maximum number of iterations to run the training script for. This is an optional hyperparameter that can be used to tweak the Training Job execution

FlexAI’s `training run` Command

Putting it all together, the command to run the Training Job looks like this:

flexai training run quickstart-training-job \
    --dataset nanoGPT-dataset=my_dataset \
    --repository-url https://github.com/flexaihq/nanogpt \
    -- train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500

Zooming into the flexai training run flags

These are the flags we’ll use to run the Quickstart Tutorial’s Training Job:

Argument / Flag	Value	Description
Training Job Name	`quickstart-training-job`	The name of the Training Job
Dataset Name	`nanoGPT-dataset=my_dataset`	The name of the Dataset to be used for the Training Job name followed by a custom name to use when mounting the dataset files into the `/input` path: /input/my_dataset
Repository URL	`https://github.com/flexaihq/nanogpt`	The URL of the GitHub repository containing the training code
Entry Point Script	`train.py`	The path of entry point training script as defined by the repository

A Training Job’s Checkpoints

Once the Training Job is running, every time its code calls the torch.save() function, FlexAI’s Managed Checkpoints feature will automatically capture a Checkpoint and store it in the /output-checkpoint directory.

Each Checkpoint will be assigned a unique ID and its creation time will be recorded.

This means that you can go to a specific point in time and retrieve the state of the model at that moment, allowing you to resume training from that point or evaluate the model’s performance on a validation dataset.

After a Training Job completes, the last Checkpoint will be the one with the most recent creation timestamp.

Up next

Next you’ll learn how to get a Training Job’s details and monitor its progress.