Skip to content

training checkpoints

Lists out the Checkpoints that have been generated for a Training Job.

Checkpoints are generated by the FlexAI runtime when a Training script’s code calls the torch.save() function πŸ”—.

Terminal window
flexai training checkpoints <training_job_name>
Required

The name of the Training Job to list checkpoints for.

Examples
  • gpt2training-1
--json
<boolean>
Optional
Flag

Output the information in JSON format.

Examples
  • --json
Terminal window
flexai training checkpoints gpt2training-1

Which will output:

flexai training checkpoints gpt2training-1
ID β”‚ NAME β”‚ NODE β”‚ STEP β”‚ TRAIN LOSS β”‚ EVAL LOSS β”‚ MODEL β”‚ VERSION β”‚ INFERENCE READY β”‚ TIMESTAMP
──────────────────────────────────────┼───────────────┼──────┼──────┼────────────┼───────────┼─────────────────┼─────────┼─────────────────┼───────────────────────────────────
ce4d8def-b6bf-4cc6-8067-de9d312a82c5 β”‚ checkpoint-50 β”‚ 0 β”‚ 50 β”‚ 3.3707 β”‚ 3.1356 β”‚ GPT2LMHeadModel β”‚ 4.53.2 β”‚ true β”‚ 2025-07-15 09:30:10.549 +0000 UTC
cabf2516-59f5-4ce7-8d14-220a8ca57ba7 β”‚ checkpoint-99 β”‚ 0 β”‚ 90 β”‚ 3.2857 β”‚ 3.1356 β”‚ GPT2LMHeadModel β”‚ 4.53.2 β”‚ true β”‚ 2025-07-15 09:30:51.46 +0000 UTC
c26d1452-6d41-4b23-805e-30eeeadca729 β”‚ β”‚ 0 β”‚ 99 β”‚ 3.2857 β”‚ 3.1356 β”‚ GPT2LMHeadModel β”‚ 4.53.2 β”‚ true β”‚ 2025-07-15 09:30:51.459 +0000 UTC
ColumnDescription
IDThe unique identifier of the checkpoint.
NAMEThe human-readable name of the checkpoint.
NODEThe Node where the checkpoint was created.
STEPThe training step at which the checkpoint was created.
TRAIN LOSSThe training loss at the time the checkpoint was created.
EVAL LOSSThe evaluation loss at the time the checkpoint was created.
MODELThe name of the base model used in the checkpoint.
VERSIONThe version of the model used in the checkpoint, such as 4.53.2.
INFERENCE READYIndicates whether the checkpoint is ready for inference or not. Meaning it includes all necessary files and metadata required for inference to tasks.
TIMESTAMPThe ISO 8601 formatted timestamp of when the checkpoint was created.