Getting a Fine-tuning Job's Output
FlexAI Managed Checkpoints
FlexAI’s Managed Checkpoints feature enables you to get the final result of your Training Job after it completes, as well as being able to get intermediate checkpoints generated by your Training script.
The only thing you need to do is to make sure your Training script calls the torch.save() function and writes its output to the path specified by the FLEXAI_OUTPUT_CHECKPOINT_DIR environment variable. FlexAI’s Managed Checkpoints will handle the rest.
Once the Training Job is running, every time its code calls the torch.save() function, FlexAI’s Managed Checkpoints feature will automatically capture a Checkpoint and store it in the /output-checkpoint directory.
Each Checkpoint will be assigned a unique ID and its creation time will be recorded.
This means that you can go to a specific point in time and retrieve the state of the model at that moment, allowing you to resume training from that point or evaluate the model’s performance on a validation dataset.
After a Training Job completes, the last Checkpoint will be the one with the most recent creation timestamp.
You can retrieve checkpoints generated by FlexAI’s Managed Checkpoints at any point, which allows you to go back to a previous point in the past to resume training, to test your model, or to use it for inference.
Listing Checkpoints
You can list all available checkpoints for a specific Training Job by running the flexai training checkpoints command:
flexai training checkpoints quickstart-training-jobThis will return a table with a list of Checkpoint IDs and their corresponding creation timestamps, similar to the following:
ID │ TIMESTAMP──────────────────────────────────────┼──────────────────────────────────── 50e5ec69-32b6-e483-9c49-38a73cc34294 │ 2025-06-30 12:42:55.214 +0100 WEST 82d21263-8ba8-dd73-9c61-732d3b7b0adc │ 2025-06-30 12:43:01.77 +0100 WEST 32d07a60-61cc-4598-b4f6-2073a4f8d0af │ 2025-06-30 12:43:14.734 +0100 WESTGetting Checkpoints
You can select the gear icon ⚙️ (labeled as Configure) in the Actions field of the Training Jobs list page. This will open a “Details” panel. The Details tab will be selected by default, showing all the relevant information about your Training Job.
Navigate to the Checkpoints tab to view the list of checkpoints created during the Training Job. Each checkpoint entry includes details such as:
- Actions:
- Download
- Deploy: If the Checkpoint is an “Inference-ready Checkpoint”, you can deploy it directly as an Inference Endpoint
- Created:
- Creation date and age
- Training Loss: The reported training loss at the time the checkpoint was created
- Evaluation Loss: The reported evaluation loss at the time the checkpoint was created
- Status: The status of the checkpoint (
availableorprocessing)
Once you have the desired Checkpoint ID, you can download it to your host machine using the flexai checkpoint fetch command:
flexai checkpoint fetch 32d07a60-61cc-4598-b4f6-2073a4f8d0afWriting in: /home/diego/ckpt.ptProgress: 0.4% (1.31 MB / 343.79 MB)// ...Progress: 100% (343.79 MB / 343.79 MB)You can use this checkpoint file to resume training from the exact point it was saved, or to evaluate the model’s performance on a validation dataset.
Workload Raw Outputs
Currently, the FlexAI Console does not support this feature. Please refer to the "Using the FlexAI CLI" instructions instead.
Any data written to the /output directory will be compressed into a zip file and made available to you via the flexai training fetch command:
flexai training fetch quickstart-training-jobThis will download a .zip file to the current working directory on your host machine.
Once extracted you’ll get a local directory named output it will contain any files written to the /output directory by the training scripts.