Running a Training Job
With a Dataset available on your FlexAI Cloud Services account, you can now run a training job using the flexai training run
command.
As mentioned before, a Training Job requires at least a name, a Dataset, a link to a GitHub repository with the training code, and the path to the entry point script.
For this tutorial we will use the FlexAI fork of the nanoGPT repository originally created by Andrej Karpathy: https://github.com/flexaihq/nanogpt.
Training Job Flags
These are the flags we'll use to run the Quickstart Tutorial's Training Job:
Argument / Flag | Value | Description |
---|---|---|
Training Job Name | quickstart-training-job | The name of the Training Job |
Dataset Name | nanoGPT-dataset | The name of the dataset to be used for the Training Job |
Repository URL | https://github.com/flexaihq/nanogpt | The URL of the GitHub repository containing the training code |
Repository Revision | flexai-main | The revision (branch in this case) of the repository with optimizations made to the original code |
Entry Point Script | train.py | The path of entry point training script as defined by the repository |
Training Script Parameters
These include any Environment Settings and Hyperparameters the training script may require. For this tutorial:
Parameter | Type | Description |
---|---|---|
config/train_shakespeare_char.py | Environment Setting | A positional argument pointing to a configuration file used by nanoGPT's train.py script to set the default Training Parameters |
--out_dir=/output | Environment Setting | The output directory where the training script will write its output files. It should always be /output when running on FlexAI |
--max_iters=1500 | Hyperparameter | The maximum number of iterations to run the training script for. This is an optional hyperparameter that can be used to tweak the Training Job execution |
FlexAI's training run
Command
Putting it all together, the command to run the training job looks like this:
flexai training run quickstart-training-job \
--dataset nanoGPT-dataset \
--repository-url https://github.com/flexaihq/nanogpt \
--repository-revision flexai-main \
train.py config/train_shakespeare_char.py --out_dir=/output --max_iters=1500
Listing Training Jobs
You can use the list
command to view the status of all the training runs:
flexai training list
This provides an output similar to the following:
NAME | DEVICE | NODE | ACCELERATOR | DATASET | REPOSITORY | STATUS | AGE
------------------------+--------+------+-------------+-----------------+-------------------------------------+----------+------
quickstart-training-job | nvidia | 1 | 1 | nanoGPT-dataset | https://github.com/flexaihq/nanogpt | building | 15s
Viewing Logs
Once the Training Job stats, logs emitted during the process can be retrieved by running the flexai training logs
command:
flexai training logs quickstart-training-job
This will output a stream of logs including both the FlexAI runtime execution logs and any stdout and stderr messages emitted by the training scripts.
Infrastructure Metrics
Once your training is running, you can monitor its performance using the FlexAI Infrastructure Monitor. It provides real-time system and GPU metrics that will enable you to make any optimizations to your training scripts so you can take full advantage of FCS compute resources.
Use the FlexAI Infrastructure Monitor data to plan your scaling strategies, make batch size or data preprocessing decisions, and in general, make informed decisions based on the data of any current or past trainings during the last 30 days.
Getting detailed training run information
You can have a deeper look at the the training run status using the flexai training inspect <training run name>
command. Especially useful for debugging purposes:
flexai training inspect quickstart-training-job
Downloading the Training Job's output
Any data written to the /output
directory will be compressed into a zip file and made available to you via the flexai training fetch
command:
flexai training fetch quickstart-training-job
This will download a output_0.zip
file to the current working directory on your host machine.
Once extracted you'll get an output
directory that contains a ckpt.pt
file, which is the checkpoint of the nanoGPT model you just trained!
Success!
You're ready to get started!
You've learned how to upload a Dataset and then use it to run a Training Job using training code hosted on a public GitHub repository.
You now have the knowledge required to create run your own Training Jobs on FlexAI by integrating your own public or private Code Repositories and loading your datasets.
Next Steps
Using Private Code Repositories
You can use any public or private GitHub repository as the source of your training code when using the --repository-url
flag.
However, you can also use the flexai code-registry
command to connect your GitHub account to FlexAI and use any of your private repositories as well.
Uploading Datasets from Remote Sources
FlexAI makes it easy to upload Datasets from your host machine through the flexai dataset push
command. But wait, there's more!
You can also push Datasets from remote sources, such as S3, GCS, MinIO or R2.
Interactive Training Jobs
With FlexAI you can run an "Interactive Training Job session" that allows you SSH into a Training Environment where you have access to the entire system by using the flexai training debug-ssh
command.
This is useful for debugging and testing purposes, allowing you to test your training code in the environment it'll be running on, reducing iteration times.
CLI Command Reference
Explore the CLI Command Reference pages to learn about all the ways you can use the FlexAI CLI to manage your workloads.
You will find a page for each CLI Command along with each of its subcommands, example usage, recommendations, flags you can use, output messages, and more!