Command: training
The flexai training
command manages Training Jobs: from starting a new Training Job, retrieving its logs, and inspecting its execution, to downloading its output artifacts, among other tasks.
Available subcommands
Section titled “Available subcommands”flexai training checkpoints
- Lists out the checkpoints that were generated by a Training Job.flexai training debug-ssh
- Establishes an SSH connection to a running Training Job.flexai training delete
- Deletes a Training Job.flexai training fetch
- Fetches artifacts from a Training Job.flexai training inspect
- Displays detailed information about a Training Job.flexai training list
- Lists all the Training Jobs.flexai training logs
- Displays the logs from a Training Job.flexai training run
- Starts a new Training Job.flexai training stop
- Stops a Training Job.
Training Lifecycle Statuses
Section titled “Training Lifecycle Statuses”A Training Job will go through a subset of the following statuses during its lifecycle:
Status | Description | Is Terminal |
---|---|---|
pending | The initial status of a training. It means the Training Job was stored in FCS | false |
scheduling | FCS is looking for a suitable Node to handle the training given its hardware requirements | false |
rejected | A Node that meets the requirements specified for the Training Job could not be found or is unavailable at the moment | true |
building | A Node suitable for the Training Job was found. The building process has started. FCS is gathering all the components required for the training, in particular, cloning the revision specified for the source repository and installing the required dependencies | false |
in progress | The building process completed successfully. The required FCS compute resources are being allocated | false |
enqueued | The required compute resources specified for the Training Job were not available at the time. The Training Job is temporarily put on hold and will be scheduled to start once the required resources are freed up | false |
succeeded | The Training Job completed successfully. The entry point training script terminated with exit code 0. Output artifacts can be downloaded using the training fetch command | true |
failed | A Training Job can fail because one of the following reasons:
| true |
stop in progress | The Training Job stopping process initiation was requested by the user and it is being performed | false |
stopped | The Training Job was successfully stopped. If the Training Job was stopped while is status was scheduling , building , or enqueued , then no GPU resources were allocated If it was stopped after the training status changed to building , then the hardware resources were allocated and eventually released after the Training Job was successfully stopped | true |
stop failed | The process of stopping a the Training Job failed | true |
A Training Job in a “terminal” status can be deleted using the flexai training delete
command.