Command: training
The flexai training
command manages training operations: from starting a new Training Job, retrieving its logs, and inspecting its execution, to downloading its output artifacts.
training checkpoints
Lists out the checkpoints that were generated by a Training Job.
Training scripts should use the torch.save()
function to write checkpoint files to the Training runtime environment's /output
directory.
flexai training checkpoints <training_name>
Example
flexai training checkpoints gpt2training-1
ID | TIMESTAMP
---------------------------------------+---------------------------------
a1b18a7f-9b85-4c74-91a9-6aca526e8ce4 | 2024-12-18 13:48:04.078 +0000 WET
90c8f215-e131-4f9c-936c-12fe1fe9a6f1 | 2024-12-18 13:48:15.061 +0000 WET
304dd2cd-9067-4146-8258-782d70e2be5e | 2024-12-18 13:48:26.057 +0000 WET
training debug-ssh
Starts an Interactive Training Job that allows connecting through SSH or VSCode to the Training Runtime, useful for fast test iterations.
The --vscode
flag is optional, but highly recommended to leverage the full potential of the Interactive Training Runtime.
flexai training debug-ssh --repository-url <repository_url> [--vscode]
Visit the Interactive Training guide to get more details on how to get started, recommendations and troubleshooting options.
Flags
Flag | Type | Optional / Required | Definition | Example |
---|---|---|---|---|
--authorized-keys | String | Optional | Path to an SSH public key. Note that if a host ssh-agent is not running, this flag will be required | ~/.ssh/id\_ed25519.pub |
-b / --repository-revision | String | Optional | Either of: Commit, Branch or Tag of the repository specified -u / --repository-url . Default: main | 53f6b645fc5d039152aef884def64288e3eeb56b |
-C / --checkpoint | String | Optional | Either The ID of the checkpoint generated during Training Job's execution (see flexai training checkpoints ), or the name of a user-provided checkpoint (see flexai checkpoint ) | a1b18a7f-9b85-4c74-91a9-6aca526e8ce4 , mistral-500-checkpoint |
-D / --dataset | String or Key/Value mapping | Optional | Key/Value pair representing the Dataset to mount to Training Runtime and its destination mount path. Syntax: <dataset_name>=<dataset_mount_path> (=<dataset_mount_path> is optional).Multiple Datasets can be passed to a Training Job, and each will be mounted to either /input/<dataset_name> or /input/<dataset_mount_path> depending on how the value is passed | open-web-text |
-d / --device-arch | String | Optional | Hardware target | nvidia |
--dotfiles | String | Optional | GitHub repository URL to a dotfiles repository that will be installed in the Interactive Training Runtime with yadm | https://github.com/OrganizationName/dotfiles.git |
-E / --env | Key/Value mapping | Optional | Environment variable in the format -E key=value where key is the name of the environment variable on the Training Job's environment and value is its value. The key should follow the defined naming conventions | WANDB_PROJECT=my-wandb-project |
--git-author-email | String | Optional | The pre-configured git config user.email in the Interactive Training Runtime, default to git config user.email in the host environment | george@vandelay-industri.es |
--git-author-name | String | Optional | The pre-configured git config user.name in the Interactive Training Runtime, defaults to git config user.name in the host environment | "George Costanza" |
-S / --secret | Key/Value mapping | Optional | Secret to be installed inside the Interactive Training Runtime in the format -S key=value where key is the name of the environment variable on the Training Job's Runtime where the Secret associated with value will be placed. Secret values not exposed to commands such as flexai training inspect . Use flexai secret create to add Secrets. The key should follow the defined naming conventions | WANDB_API_KEY=wandb |
--session-timeout | Integer | Optional | The duration in seconds that the Interactive Training Job will remain available without any active SSH session. Default: 600 | 800 |
-u / --repository-url | String | Required | The URL of the repository with the model's source code to use | https://github.com/flexaihq/nanoGPT/ |
--vscode | Boolean | Optional | Immediately open a new instance of VSCode and attach to the Interactive Training Runtime when it is ready. Requires the Remote SSH VSCode extension |
More information about Interactive Training, how to better leverage it and how to establish a connection is available in the dedicated Interactive Training guide.
training delete
Deletes a Training Job. Only Training Jobs that are in a "terminal" status can be deleted. Visit the Training Lifecycle Statuses table for more information on "terminal" statuses.
flexai training delete <training_name>
training fetch
Retrieves the output artifacts of the specified training as a compressed file that will be downloaded to the host machine.
flexai training fetch <training_name> [--destination <output_directory>]
Flags
Flag | Type | Optional / Required | Definition | Example |
---|---|---|---|---|
-d / --destination | String | Optional | Existing local directory where the output artifacts will be downloaded | -d training_artifacts |
training inspect
Returns detailed information about a Training Job from its inception time. It outputs the information in YAML format by default:
flexai training inspect <training_name>
Flags
Flag | Type | Optional / Required | Definition | Example |
---|---|---|---|---|
--json | Flag | Optional | Output the information in JSON format | --json |
Details on the returned information
Returned information
Field | Description | Data Type |
---|---|---|
metadata | Metadata about the Training Job | Object |
metadata | Metadata about the Training Job | Object |
metadata.id | Unique identifier for the Training Job | UUID |
metadata.name | Name of the Training Job | String |
metadata.creatorUserID | User ID of the creator | UUID |
metadata.ownerOrgID | Organization ID of the owner | UUID |
config | Training Job settings | Object |
config.device | Type of device used for training | String |
config.nodes | Number of nodes used | Integer |
config.accelerator | Number of accelerators used | Integer |
config.entrypoint | List of arguments passed after the End-of-Options marker (-- ) | String List |
config.datasetsNames | Names of the Datasets being used | String List |
config.checkpointName | Name of the Checkpoint being used, if any | String |
config.repositoryURL | URL of the code repository being used | String |
config.repositoryRevision | Git revision being used (Branch, Tag, Hash) | String |
config.secrets | List of Secrets that have been set, if any | String |
config.environment | List of Environment variables that have been set, if any | String |
runtime | Status of the Training Job run | Object |
runtime.status | Current status of the Training Job | String |
runtime.queuePosition | Position in the queue | Integer |
runtime.selectedAgentId | ID of the selected agent | String |
runtime.repositoryRevisionSha | SHA hash of the repository's revision being used | String |
runtime.createdAt | Timestamp when the Training Job was created | ISO 8601 Timestamp |
runtime.lastUpdate | Timestamp of the last update | ISO 8601 Timestamp |
runtime.lifecycleEvents | List of lifecycle events | Object List |
See below for lifecycle events statuses.
Example
flexai training inspect nanogpt-training-job
metadata:
id: 0409fb6a-0925-4644-b54c-baeb3e0401e5
name: nanogpt-training-job
creatorUserID: bd67af19-2599-4a57-832e-a1ac042f48be
ownerOrgID: 270a5476-b91a-442f-8a13-852ef7bb5b94
config:
device: nvidia
nodes: 2
accelerator: 8
entrypoint:
- train.py
- config/train_shakespeare_char.py
- --out_dir=/output
datasetsNames:
- nanoGPT-dataset
checkpointName: ""
repositoryURL: https://github.com/flexaihq/nanoGPT.git
repositoryRevision: flexai-main
secrets: []
environment: []
runtime:
status: succeeded
queuePosition: 0
selectedAgentId: k8s-training-oci-001-client-prod
repositoryRevisionSha: f93e83faeae47f02891ca1818aeff6ae4d42eb39
createdAt: "2024-10-02T13:21:10Z"
lastUpdate: "2024-10-02T13:28:36Z"
lifecycleEvents:
- ...
training list
Lists all Training Jobs that have been initiated.
flexai training list
Example
A list of Training Jobs that have been initiated, presented in a tabular format with the following data:
flexai training list
NAME | DEVICE | NODE | ACCELERATOR | DATASET | REPOSITORY | STATUS | AGE
-------------------------------+--------+------+-------------+------------------------+-------------------------------------------------+-----------+------
training-job--will-succeed | nvidia | 1 | 1 | dataset-nanogpt | https://github.com/flexaihq/nanoGPT.git | succeeded | 2d
training-job-doomed-to-fail | nvidia | 1 | 1 | my-other-dataset | https://github.com/my-org-name/my-repo-123.git | failed | 1d
training logs
Displays a stream of logs as emitted by the training script during its execution time. Prints out messages emitted by the training script to the standard output (stdout) or standard error (stderr) streams.
flexai training logs <training_name>
training run
Initiates the process of allocating the required resources to create an environment where a training workload will be executed. The minimum requirements for a Training Job are:
- A name
- A source code repository URL
- At least one Dataset
- An entry point script with any amount of arguments it may require
flexai training run <training_name> \
--repository-url <repository_url> --dataset <dataset_name>...<entry_point_script_path>
Arguments
Argument | Definition | Example |
---|---|---|
training_name | Resource name. Must follow the FCS resource naming conventions | gpt2-training-1 |
-- | End-of-options marker. The first positional argument will be your entry point script, which is generally the training script | -- |
entry_point_script_path | The path to the script in the source code repository that will initiate the training process | main.py , src/train.py |
Flags
Flag | Type | Optional / Required | Definition | Example |
---|---|---|---|---|
-a / --accels | Integer | Optional | The number of accelerators to use | 1 |
-b / --repository-revision | String | Optional | Either of: Commit, Branch or Tag of the repository specified -u / --repository-url . Default: main | 53f6b645fc5d039152aef884def64288e3eeb56b |
-C / --checkpoint | String | Optional | Either The ID of the checkpoint generated during Training Job's execution (see flexai training checkpoints ), or the name of a user-provided checkpoint (see flexai checkpoint ) | a1b18a7f-9b85-4c74-91a9-6aca526e8ce4 , mistral-500-checkpoint |
-D / --dataset | String or Key/Value mapping | Optional | Key/Value pair representing the Dataset to mount to Training Runtime and its destination mount path. Syntax: <dataset_name>=<dataset_mount_path> (=<dataset_mount_path> is optional).Multiple Datasets can be passed to a Training Job, and each will be mounted to either /input/<dataset_name> or /input/<dataset_mount_path> depending on how the value is passed | open-web-text |
-d / --device-arch | String | Optional | Hardware target | nvidia |
-E / --env | Key/Value mapping | Optional | Environment variable in the format -E key=value where key is the name of the environment variable on the Training Job's environment and value is its value. The key should follow the defined naming conventions | WANDB_PROJECT=my-wandb-project |
-n / --nodes | Integer | Optional | The number of parallel nodes running the Training Job | 1 |
-S / --secret | Key/Value mapping | Optional | Secret to be passed to the Training Job in the format -S key=value where key is the name of the environment variable on the Training Job's environment where the Secret associated with value will be placed. Secret values not exposed to commands such as flexai training inspect . Use flexai secret create to add Secrets. The key should follow the defined naming conventions | WANDB_API_KEY=wandb |
-u / --repository-url | String | Required | The URL of the repository with the model's source code to use | https://github.com/flexaihq/nanoGPT |
If you use 1 node (--nodes
) and 1 accelerator (--accels
), FlexAI Cloud Services will run your script using Python.
If you use more than 1 accelerator or node, your script will be executed using torchrun
and you will have access to its standard environment variables.
Visit the "Running a training" section of the Quickstart tutorial for a practical example using the nanoGPT model.
training stop
Stops a Training Job.
flexai training stop <training_name>
A Training Job can be stopped when it is in any of the following statuses:
scheduling
building
in progress
enqueued
Training Lifecycle Statuses
A Training Job will go through a subset of the following statuses during its lifecycle:
Status | Description | Is Terminal |
---|---|---|
pending | The initial status of a training. It means the Training Job was stored in FCS | false |
scheduling | FCS is looking for a suitable Node to handle the training given its hardware requirements | false |
rejected | A Node that meets the requirements specified for the Training Job could not be found or is unavailable at the moment | true |
building | A Node suitable for the Training Job was found. The building process has started. FCS is gathering all the components required for the training, in particular, cloning the revision specified for the source repository and installing the required dependencies | false |
in progress | The building process completed successfully. The required FCS compute resources are being allocated | false |
enqueued | The required compute resources specified for the Training Job were not available at the time. The Training Job is temporarily put on hold and will start be scheduled to start once the required resources are freed up | false |
succeeded | The Training Job completed successfully. The entrypoint training script terminated with exit code 0. Output artifacts can be downloaded using the training fetch command | true |
failed | A Training Job can fail because one of the following reasons:
| true |
stop in progress | The Training Job stopping process initiation was requested by the user and it is being performed | false |
stopped | The Training Job was successfully stopped. If the Training Job was stopped while is status was scheduling , building , or enqueued , then no GPU resources were allocated If it was stopped after the training status changed to building , then the hardware resources were allocated and eventually released after the Training Job was successfully stopped | true |
stop failed | The process of stopping a the Training Job failed | true |
A Training Job in a "terminal" status can be deleted using the flexai training delete
command.