Skip to main content

Running a Training Job

With a Dataset available on your FlexAI Cloud Services account, you can now run a training job using the flexai training run command.

As mentioned before, a Training Job requires at least a name, a Dataset, a link to a GitHub repository with the training code, and the path to the entry point script.

For this tutorial we will use the FlexAI fork of the nanoGPT repository originally created by Andrej Karpathy: https://github.com/flexaihq/nanogpt.

Training Job Flags

These are the flags we'll use to run the Quickstart Tutorial's Training Job:

Argument / FlagValueDescription
Training Job Namequickstart-training-jobThe name of the Training Job
Dataset NamenanoGPT-datasetThe name of the dataset to be used for the Training Job
Repository URLhttps://github.com/flexaihq/nanogptThe URL of the GitHub repository containing the training code
Repository Revisionflexai-mainThe revision (branch in this case) of the repository with optimizations made to the original code
Entry Point Scripttrain.pyThe path of entry point training script as defined by the repository

Training Script Parameters

These include any Environment Settings and Hyperparameters the training script may require. For this tutorial:

ParameterTypeDescription
config/train_shakespeare_char.pyEnvironment SettingA positional argument pointing to a configuration file used by nanoGPT's train.py script to set the default Training Parameters
--out_dir=/outputEnvironment SettingThe output directory where the training script will write its output files. It should always be /output when running on FlexAI
--max_iters=1500HyperparameterThe maximum number of iterations to run the training script for. This is an optional hyperparameter that can be used to tweak the Training Job execution

FlexAI's training run Command

Putting it all together, the command to run the training job looks like this:

flexai training run quickstart-training-job \
--dataset nanoGPT-dataset \
--repository-url https://github.com/flexaihq/nanogpt \
--repository-revision flexai-main \
-- train.py config/train_shakespeare_char.py --out_dir=/output --max_iters=1500

Listing Training Jobs

You can use the list command to view the status of all the training runs:

flexai training list

This provides an output similar to the following:

NAME | DEVICE | NODE | ACCELERATOR | DATASET | REPOSITORY | STATUS | AGE
------------------------+--------+------+-------------+-----------------+-------------------------------------+----------+------ quickstart-training-job | nvidia | 1 | 1 | nanoGPT-dataset | https://github.com/flexaihq/nanogpt | building | 15s

Viewing Logs

Once the Training Job stats, logs emitted during the process can be retrieved by running the flexai training logs command:

flexai training logs quickstart-training-job

This will output a stream of logs including both the FlexAI runtime execution logs and any stdout and stderr messages emitted by the training scripts.

Infrastructure Metrics

Once your training is running, you can monitor its performance using the FlexAI Infrastructure Monitor. It provides real-time system and GPU metrics that will enable you to make any optimizations to your training scripts so you can take full advantage of FCS compute resources.

Use the FlexAI Infrastructure Monitor data to plan your scaling strategies, make batch size or data preprocessing decisions, and in general, make informed decisions based on the data of any current or past trainings during the last 30 days.

Getting detailed training run information

You can have a deeper look at the the training run status using the flexai training inspect <training run name> command. Especially useful for debugging purposes:

flexai training inspect quickstart-training-job

Downloading the Training Job's output

Any data written to the /output directory will be compressed into a zip file and made available to you via the flexai training fetch command:

flexai training fetch quickstart-training-job

This will download a output_0.zip file to the current working directory on your host machine.

Once extracted you'll get an output directory that contains a ckpt.pt file, which is the checkpoint of the nanoGPT model you just trained!

Success!

You're ready to get started!

You've learned how to upload a Dataset and then use it to run a Training Job using training code hosted on a public GitHub repository.

You now have the knowledge required to create run your own Training Jobs on FlexAI by integrating your own public or private Code Repositories and loading your datasets.


Next Steps

Using Private Code Repositories

You can use any public or private GitHub repository as the source of your training code when using the --repository-url flag.

However, you can also use the flexai code-registry command to connect your GitHub account to FlexAI and use any of your private repositories as well.

Uploading Datasets from Remote Sources

FlexAI makes it easy to upload Datasets from your host machine through the flexai dataset push command. But wait, there's more!

You can also push Datasets from remote sources, such as S3, GCS, MinIO or R2.

Interactive Training Jobs

With FlexAI you can run an "Interactive Training Job session" that allows you SSH into a Training Environment where you have access to the entire system by using the flexai training debug-ssh command.

This is useful for debugging and testing purposes, allowing you to test your training code in the environment it'll be running on, reducing iteration times.

CLI Command Reference

Explore the CLI Command Reference pages to learn about all the ways you can use the FlexAI CLI to manage your workloads.

You will find a page for each CLI Command along with each of its subcommands, example usage, recommendations, flags you can use, output messages, and more!