Skip to content

Creating a Training Job

With a Dataset available on your FlexAI account, you can now run a Training Job using the flexai training run command.

A Training Job requires at least a name, a Dataset name, a link to a GitHub repository with the training code, the revision (branch name) where the code resides, and the path to the entry point script that will initiate the Training Job.

In addition, the entry point script can be followed by any arguments you want to pass to it, such as configurations and Hyperparameters.

For this tutorial we will use the FlexAI fork of the nanoGPT repository originally created by Andrej Karpathy: https://github.com/flexaihq/nanogpt 🔗.

nanoGPT’s entry point script path is train.py, and it expects a path to a configuration file as well as a few Hyperparameters to run the Training Job.

The entry point script for this Training Job is train.py, and it expects the following arguments:

  • config/train_shakespeare_char.py:A configuration file, which contains the default Training Parameters.
  • --dataset_dir: The path within the /input directory of the Training Runtime where the Dataset files are located.
  • --out_dir: The output directory, which will be mounted into the Training Runtime as /output-checkpoint.
  • --max_iters: The maximum number of iterations to run the training script for (optional).
Entry Point script arguments details

These include any Environment Settings and Hyperparameters the training script may require. For this tutorial:

ParameterTypeDescription
config/train_shakespeare_char.pyEnvironment SettingA positional argument pointing to a configuration file used by nanoGPT’s train.py script to set the default Training Parameters
--out_dir=/output-checkpointEnvironment SettingThe output directory where the training script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be /output-checkpoint
--max_iters=1500HyperparameterThe maximum number of iterations to run the training script for. This is an optional hyperparameter that can be used to tweak the Training Job execution

Putting it all together, the command to run the Training Job looks like this:

Terminal window
flexai training run quickstart-training-job \
--dataset nanoGPT-dataset=my_dataset \
--repository-url https://github.com/flexaihq/nanogpt \
-- train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500
Zooming into the flexai training run flags

These are the flags we’ll use to run the Quickstart Tutorial’s Training Job:

Argument / FlagValueDescription
Training Job Namequickstart-training-jobThe name of the Training Job
Dataset NamenanoGPT-dataset=my_datasetThe name of the Dataset to be used for the Training Job name followed by a custom name to use when mounting the dataset files into the /input path: /input/my_dataset
Repository URLhttps://github.com/flexaihq/nanogptThe URL of the GitHub repository containing the training code
Entry Point Scripttrain.pyThe path of entry point training script as defined by the repository

Once the Training Job is running, every time its code calls the torch.save() function, FlexAI’s Managed Checkpoints feature will automatically capture a Checkpoint and store it in the /output-checkpoint directory.

Each Checkpoint will be assigned a unique ID and its creation time will be recorded.

This means that you can go to a specific point in time and retrieve the state of the model at that moment, allowing you to resume training from that point or evaluate the model’s performance on a validation dataset.

After a Training Job completes, the last Checkpoint will be the one with the most recent creation timestamp.

Next you’ll learn how to get a Training Job’s details and monitor its progress.