Skip to content

Command: training

The flexai training command manages Training Jobs: from starting a new Training Job, retrieving its logs, and inspecting its execution, to downloading its output artifacts, among other tasks.

A Training Job will go through a subset of the following statuses during its lifecycle:

StatusDescriptionIs Terminal
pendingThe initial status of a training. It means the Training Job was stored in FCSfalse
schedulingFCS is looking for a suitable Node to handle the training given its hardware requirementsfalse
rejectedA Node that meets the requirements specified for the Training Job could not be found or is unavailable at the momenttrue
buildingA Node suitable for the Training Job was found. The building process has started. FCS is gathering all the components required for the training, in particular, cloning the revision specified for the source repository and installing the required dependenciesfalse
in progressThe building process completed successfully. The required FCS compute resources are being allocatedfalse
enqueuedThe required compute resources specified for the Training Job were not available at the time. The Training Job is temporarily put on hold and will be scheduled to start once the required resources are freed upfalse
succeededThe Training Job completed successfully. The entry point training script terminated with exit code 0. Output artifacts can be downloaded using the training fetch commandtrue
failedA Training Job can fail because one of the following reasons:
  • The --repository-revision could not be found.
  • The requirements.txt file could not be found in the root directory of the repository.
  • The Training Job failed. The Entry point training script terminated with an exit code above 0.
  • The Training Job’s duration exceeded the time limit (BackoffLimitExceeded).
true
stop in progressThe Training Job stopping process initiation was requested by the user and it is being performedfalse
stoppedThe Training Job was successfully stopped. If the Training Job was stopped while is status was scheduling, building, or enqueued, then no GPU resources were allocated

If it was stopped after the training status changed to building, then the hardware resources were allocated and eventually released after the Training Job was successfully stopped
true
stop failedThe process of stopping a the Training Job failedtrue

A Training Job in a “terminal” status can be deleted using the flexai training delete command.