Skip to main content

Command: training

The flexai training command manages training operations: from starting a new Training Job, retrieving its logs, and inspecting its execution, to downloading its output artifacts.

training checkpoints

Lists out the checkpoints that were generated by a Training Job.

Training scripts should use the torch.save() function to write checkpoint files to the Training runtime environment's /output directory.

flexai training checkpoints <training_name>

Example

flexai training checkpoints gpt2training-1

ID | TIMESTAMP
---------------------------------------+---------------------------------
a1b18a7f-9b85-4c74-91a9-6aca526e8ce4 | 2024-12-18 13:48:04.078 +0000 WET
90c8f215-e131-4f9c-936c-12fe1fe9a6f1 | 2024-12-18 13:48:15.061 +0000 WET
304dd2cd-9067-4146-8258-782d70e2be5e | 2024-12-18 13:48:26.057 +0000 WET

training debug-ssh

Starts an Interactive Training Job that allows connecting through SSH or VSCode to the Training Runtime, useful for fast test iterations.

The --vscode flag is optional, but highly recommended to leverage the full potential of the Interactive Training Runtime.

flexai training debug-ssh --repository-url <repository_url> [--vscode]

Visit the Interactive Training guide to get more details on how to get started, recommendations and troubleshooting options.

Flags

FlagTypeOptional / RequiredDefinitionExample
--authorized-keysStringOptionalPath to an SSH public key. Note that if a host ssh-agent is not running, this flag will be required~/.ssh/id\_ed25519.pub
-b / --repository-revisionStringOptionalEither of: Commit, Branch or Tag of the repository specified -u / --repository-url. Default: main53f6b645fc5d039152aef884def64288e3eeb56b
-C / --checkpointStringOptionalEither The ID of the checkpoint generated during Training Job's execution (see flexai training checkpoints), or the name of a user-provided checkpoint (see flexai checkpoint)a1b18a7f-9b85-4c74-91a9-6aca526e8ce4, mistral-500-checkpoint
-D / --datasetString or Key/Value mappingOptionalKey/Value pair representing the Dataset to mount to Training Runtime and its destination mount path. Syntax: <dataset_name>=<dataset_mount_path> (=<dataset_mount_path> is optional).

Multiple Datasets can be passed to a Training Job, and each will be mounted to either /input/<dataset_name> or /input/<dataset_mount_path> depending on how the value is passed
open-web-text
-d / --device-archStringOptionalHardware targetnvidia
--dotfilesStringOptionalGitHub repository URL to a dotfiles repository that will be installed in the Interactive Training Runtime with yadmhttps://github.com/OrganizationName/dotfiles.git
-E / --envKey/Value mappingOptionalEnvironment variable in the format -E key=value where key is the name of the environment variable on the Training Job's environment and value is its value.
The key should follow the defined naming conventions
WANDB_PROJECT=my-wandb-project
--git-author-emailStringOptionalThe pre-configured git config user.email in the Interactive Training Runtime, default to git config user.email in the host environmentgeorge@vandelay-industri.es
--git-author-nameStringOptionalThe pre-configured git config user.name in the Interactive Training Runtime, defaults to git config user.name in the host environment"George Costanza"
-S / --secretKey/Value mappingOptionalSecret to be installed inside the Interactive Training Runtime in the format -S key=value where key is the name of the environment variable on the Training Job's Runtime where the Secret associated with value will be placed. Secret values not exposed to commands such as flexai training inspect. Use flexai secret create to add Secrets.
The key should follow the defined naming conventions
WANDB_API_KEY=wandb
--session-timeoutIntegerOptionalThe duration in seconds that the Interactive Training Job will remain available without any active SSH session. Default: 600800
-u / --repository-urlStringRequiredThe URL of the repository with the model's source code to usehttps://github.com/flexaihq/nanoGPT/
--vscodeBooleanOptionalImmediately open a new instance of VSCode and attach to the Interactive Training Runtime when it is ready. Requires the Remote SSH VSCode extension

More information about Interactive Training, how to better leverage it and how to establish a connection is available in the dedicated Interactive Training guide.

training delete

Deletes a Training Job. Only Training Jobs that are in a "terminal" status can be deleted. Visit the Training Lifecycle Statuses table for more information on "terminal" statuses.

flexai training delete <training_name>

training fetch

Retrieves the output artifacts of the specified training as a compressed file that will be downloaded to the host machine.

flexai training fetch <training_name> [--destination <output_directory>]

Flags

FlagTypeOptional / RequiredDefinitionExample
-d / --destinationStringOptionalExisting local directory where the output artifacts will be downloaded-d training_artifacts

training inspect

Returns detailed information about a Training Job from its inception time. It outputs the information in YAML format by default:

flexai training inspect <training_name>

Flags

FlagTypeOptional / RequiredDefinitionExample
--jsonFlagOptionalOutput the information in JSON format--json
Details on the returned information

Returned information

FieldDescriptionData Type
metadataMetadata about the Training JobObject
metadataMetadata about the Training JobObject
metadata.idUnique identifier for the Training JobUUID
metadata.nameName of the Training JobString
metadata.creatorUserIDUser ID of the creatorUUID
metadata.ownerOrgIDOrganization ID of the ownerUUID
configTraining Job settingsObject
config.deviceType of device used for trainingString
config.nodesNumber of nodes usedInteger
config.acceleratorNumber of accelerators usedInteger
config.entrypointList of arguments passed after the End-of-Options marker (--)String List
config.datasetsNamesNames of the Datasets being usedString List
config.checkpointNameName of the Checkpoint being used, if anyString
config.repositoryURLURL of the code repository being usedString
config.repositoryRevisionGit revision being used (Branch, Tag, Hash)String
config.secretsList of Secrets that have been set, if anyString
config.environmentList of Environment variables that have been set, if anyString
runtimeStatus of the Training Job runObject
runtime.statusCurrent status of the Training JobString
runtime.queuePositionPosition in the queueInteger
runtime.selectedAgentIdID of the selected agentString
runtime.repositoryRevisionShaSHA hash of the repository's revision being usedString
runtime.createdAtTimestamp when the Training Job was createdISO 8601 Timestamp
runtime.lastUpdateTimestamp of the last updateISO 8601 Timestamp
runtime.lifecycleEventsList of lifecycle eventsObject List

See below for lifecycle events statuses.

Example

flexai training inspect nanogpt-training-job
metadata:
id: 0409fb6a-0925-4644-b54c-baeb3e0401e5
name: nanogpt-training-job
creatorUserID: bd67af19-2599-4a57-832e-a1ac042f48be
ownerOrgID: 270a5476-b91a-442f-8a13-852ef7bb5b94
config:
device: nvidia
nodes: 2
accelerator: 8
entrypoint:
- train.py
- config/train_shakespeare_char.py
- --out_dir=/output
datasetsNames:
- nanoGPT-dataset
checkpointName: ""
repositoryURL: https://github.com/flexaihq/nanoGPT.git
repositoryRevision: flexai-main
secrets: []
environment: []
runtime:
status: succeeded
queuePosition: 0
selectedAgentId: k8s-training-oci-001-client-prod
repositoryRevisionSha: f93e83faeae47f02891ca1818aeff6ae4d42eb39
createdAt: "2024-10-02T13:21:10Z"
lastUpdate: "2024-10-02T13:28:36Z"
lifecycleEvents:
- ...

training list

Lists all Training Jobs that have been initiated.

flexai training list

Example

A list of Training Jobs that have been initiated, presented in a tabular format with the following data:

flexai training list

NAME | DEVICE | NODE | ACCELERATOR | DATASET | REPOSITORY | STATUS | AGE
-------------------------------+--------+------+-------------+------------------------+-------------------------------------------------+-----------+------
training-job--will-succeed | nvidia | 1 | 1 | dataset-nanogpt | https://github.com/flexaihq/nanoGPT.git | succeeded | 2d
training-job-doomed-to-fail | nvidia | 1 | 1 | my-other-dataset | https://github.com/my-org-name/my-repo-123.git | failed | 1d

training logs

Displays a stream of logs as emitted by the training script during its execution time. Prints out messages emitted by the training script to the standard output (stdout) or standard error (stderr) streams.

flexai training logs <training_name>

training run

Initiates the process of allocating the required resources to create an environment where a training workload will be executed. The minimum requirements for a Training Job are:

  • A name
  • A source code repository URL
  • At least one Dataset
  • An entry point script with any amount of arguments it may require
flexai training run <training_name> \
--repository-url <repository_url> --dataset <dataset_name>... -- <entry_point_script_path>

Arguments

ArgumentDefinitionExample
training_nameResource name. Must follow the FCS resource naming conventionsgpt2-training-1
--End-of-options marker. The first positional argument will be your entry point script, which is generally the training script--
entry_point_script_pathThe path to the script in the source code repository that will initiate the training processmain.py, src/train.py

Flags

FlagTypeOptional / RequiredDefinitionExample
-a / --accelsIntegerOptionalThe number of accelerators to use1
-b / --repository-revisionStringOptionalEither of: Commit, Branch or Tag of the repository specified -u / --repository-url. Default: main53f6b645fc5d039152aef884def64288e3eeb56b
-C / --checkpointStringOptionalEither The ID of the checkpoint generated during Training Job's execution (see flexai training checkpoints), or the name of a user-provided checkpoint (see flexai checkpoint)a1b18a7f-9b85-4c74-91a9-6aca526e8ce4, mistral-500-checkpoint
-D / --datasetString or Key/Value mappingOptionalKey/Value pair representing the Dataset to mount to Training Runtime and its destination mount path. Syntax: <dataset_name>=<dataset_mount_path> (=<dataset_mount_path> is optional).

Multiple Datasets can be passed to a Training Job, and each will be mounted to either /input/<dataset_name> or /input/<dataset_mount_path> depending on how the value is passed
open-web-text
-d / --device-archStringOptionalHardware targetnvidia
-E / --envKey/Value mappingOptionalEnvironment variable in the format -E key=value where key is the name of the environment variable on the Training Job's environment and value is its value.
The key should follow the defined naming conventions
WANDB_PROJECT=my-wandb-project
-n / --nodesIntegerOptionalThe number of parallel nodes running the Training Job1
-S / --secretKey/Value mappingOptionalSecret to be passed to the Training Job in the format -S key=value where key is the name of the environment variable on the Training Job's environment where the Secret associated with value will be placed. Secret values not exposed to commands such as flexai training inspect. Use flexai secret create to add Secrets.
The key should follow the defined naming conventions
WANDB_API_KEY=wandb
-u / --repository-urlStringRequiredThe URL of the repository with the model's source code to usehttps://github.com/flexaihq/nanoGPT
note

If you use 1 node (--nodes) and 1 accelerator (--accels), FlexAI Cloud Services will run your script using Python.

If you use more than 1 accelerator or node, your script will be executed using torchrun and you will have access to its standard environment variables.

Visit the "Running a training" section of the Quickstart tutorial for a practical example using the nanoGPT model.

training stop

Stops a Training Job.

flexai training stop <training_name>

A Training Job can be stopped when it is in any of the following statuses:

  • scheduling
  • building
  • in progress
  • enqueued

Training Lifecycle Statuses

A Training Job will go through a subset of the following statuses during its lifecycle:

StatusDescriptionIs Terminal
pendingThe initial status of a training. It means the Training Job was stored in FCSfalse
schedulingFCS is looking for a suitable Node to handle the training given its hardware requirementsfalse
rejectedA Node that meets the requirements specified for the Training Job could not be found or is unavailable at the momenttrue
buildingA Node suitable for the Training Job was found. The building process has started. FCS is gathering all the components required for the training, in particular, cloning the revision specified for the source repository and installing the required dependenciesfalse
in progressThe building process completed successfully. The required FCS compute resources are being allocatedfalse
enqueuedThe required compute resources specified for the Training Job were not available at the time. The Training Job is temporarily put on hold and will start be scheduled to start once the required resources are freed upfalse
succeededThe Training Job completed successfully. The entrypoint training script terminated with exit code 0. Output artifacts can be downloaded using the training fetch commandtrue
failedA Training Job can fail because one of the following reasons:
  • The --repository-revision could not be found.
  • The requirements.txt file could not be found in the root directory of the repository.
  • The Training Job failed. The entrypoint training script terminated with an exit code above 0.
  • The Training Job's duration exceeded the time limit (BackoffLimitExceeded).
true
stop in progressThe Training Job stopping process initiation was requested by the user and it is being performedfalse
stoppedThe Training Job was successfully stopped. If the Training Job was stopped while is status was scheduling, building, or enqueued, then no GPU resources were allocated

If it was stopped after the training status changed to building, then the hardware resources were allocated and eventually released after the Training Job was successfully stopped
true
stop failedThe process of stopping a the Training Job failedtrue

A Training Job in a "terminal" status can be deleted using the flexai training delete command.