Running a Training Job

With a Dataset available on your FlexAI Cloud Services account, you can now run a Training Job using the flexai training run command.

As mentioned before, a Training Job requires at least a name, a Dataset, a link to a GitHub repository with the training code, and the path to the entry point script.

For this tutorial we will use the FlexAI fork of the nanoGPT repository originally created by Andrej Karpathy: https://github.com/flexaihq/nanogpt.

FlexAI's `training run` Command

Without any further ado, let's get to the training run command to start a Training Job using the dataset we just uploaded and the repository we just mentioned above:

flexai training run quickstart-training-job \
    --dataset nanoGPT-dataset=my_dataset  \
    --repository-url https://github.com/flexaihq/nanogpt \
    --repository-revision flexai-main \
    -- train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output --max_iters=1500

Below you will find a breakdown of the command and its arguments.

Training Job Flags

These are the flags we'll use to run the Quickstart Tutorial's Training Job:

Argument / Flag	Value	Description
Training Job Name	`quickstart-training-job`	The name of the Training Job
Dataset Name	`nanoGPT-dataset=my_dataset`	The FCS Dataset name followed by a custom name to use when mounting the dataset files into the `/input` path: `/input/my_dataset`
Repository URL	`https://github.com/flexaihq/nanogpt`	The URL of the GitHub repository containing the training code
Repository Revision	`flexai-main`	The revision (branch in this case) of the repository with optimizations made to the original code
Entry Point Script	`train.py`	The path of entry point training script as defined by the repository

Training Script Parameters

These include any Environment Settings and Hyperparameters the training script may require. They will vary depending on your codebase. For this example using the nanoGPT repository, we've included the following:

Parameter	Type	Description
`config/train_shakespeare_char.py`	Environment Setting	A positional argument pointing to a configuration file used by nanoGPT's `train.py` script to set the default Training Parameters
`--dataset_dir=my_dataset`	Environment Setting	The name of the directory inside the `/input` path where the dataset files have been stored: `/input/my_dataset`
`--out_dir=/output`	Environment Setting	The output directory where the training script will write its output files. It should always be `/output` when running on FlexAI
`--max_iters=1500`	Hyperparameter	The maximum number of iterations to run the training script for. This is an optional hyperparameter that can be used to tweak the Training Job execution

Listing Training Jobs

You can use the list command to view the status of all the training runs:

flexai training list

This provides an output similar to the following:

NAME                    | DEVICE | NODE | ACCELERATOR |     DATASET     |             REPOSITORY              |  STATUS  | AGE
------------------------+--------+------+-------------+-----------------+-------------------------------------+----------+------
quickstart-training-job | nvidia | 1    | 1           | nanoGPT-dataset | https://github.com/flexaihq/nanogpt | building | 15s

Viewing Logs

Once the Training Job stats, logs emitted during the process can be retrieved by running the flexai training logs command:

flexai training logs quickstart-training-job

This will output a stream of logs including both the FlexAI runtime execution logs and any stdout and stderr messages emitted by the training scripts.

Infrastructure Metrics

Once your training is running, you can monitor its performance using the FlexAI Infrastructure Monitor. It provides real-time system and GPU metrics that will enable you to make any optimizations to your training scripts so you can take full advantage of FCS compute resources.

Use the FlexAI Infrastructure Monitor data to plan your scaling strategies, make batch size or data preprocessing decisions, and in general, make informed decisions based on the data of any current or past trainings during the last 30 days.

Getting detailed training run information

You can have a deeper look at the the training run status using the flexai training inspect <training run name> command. Especially useful for debugging purposes:

flexai training inspect quickstart-training-job

Downloading the Training Job's output

Any data written to the /output directory will be compressed into a zip file and made available to you via the flexai training fetch command:

flexai training fetch quickstart-training-job

This will download a output_0.zip file to the current working directory on your host machine.

Once extracted you'll get an output directory that contains a ckpt.pt file, which is the checkpoint of the nanoGPT model you just trained!

Success!

You've learned how to upload a Dataset and then use it to run a Training Job using training code hosted on a public GitHub repository.

You now have the knowledge required to create run your own Training Jobs on FlexAI by integrating your own public or private Code Repositories and loading your datasets.

Next Steps

You can use any public or private GitHub repository as the source of your training code when using the --repository-url flag.

However, you can also use the flexai code-registry command to connect your GitHub account to FlexAI and use any of your private repositories as well.

FlexAI makes it easy to upload Datasets from your host machine through the flexai dataset push command. But wait, there's more!

You can also push Datasets from remote sources, such as S3, GCS, MinIO, R2 or Hugging Face.

With FlexAI you can run an "Interactive Training Job session" that allows you SSH into a Training Environment where you have access to the entire system by using the flexai training debug-ssh command.

This is useful for debugging and testing purposes, allowing you to test your training code in the environment it'll be running on, reducing iteration times.

Explore the CLI Command Reference pages to learn about all the ways you can use the FlexAI CLI to manage your workloads.

You will find a page for each CLI Command along with each of its subcommands, example usage, recommendations, flags you can use, output messages, and more!

Running a Training Job

FlexAI's `training run` Command

Training Job Flags

Training Script Parameters

Listing Training Jobs

Viewing Logs

Infrastructure Metrics

Getting detailed training run information

Downloading the Training Job's output

Success!

You're ready to get started!

Next Steps

Using Private Code Repositories

Uploading Datasets from Remote Sources

Interactive Training Jobs

CLI Command Reference

FlexAI's training run Command​

Training Job Flags​

Training Script Parameters​

Listing Training Jobs​

Viewing Logs​

Infrastructure Metrics​

Getting detailed training run information​

Downloading the Training Job's output​

Success!​

You're ready to get started!

Next Steps​

Using Private Code Repositories

Uploading Datasets from Remote Sources

Interactive Training Jobs

CLI Command Reference

FlexAI's `training run` Command

Training Job Flags

Training Script Parameters

Listing Training Jobs

Viewing Logs

Infrastructure Metrics

Getting detailed training run information

Downloading the Training Job's output

Success!

Next Steps