Getting a Fine-tuning Job's Output

FlexAI Managed Checkpoints

Using the FlexAI Console
Using the FlexAI CLI

FlexAI’s Managed Checkpoints feature enables you to get the final result of your Training Job after it completes, as well as being able to get intermediate checkpoints generated by your Training script.The only thing you need to do is to make sure your Training script calls the torch.save() function and writes its output to the path specified by the FLEXAI_OUTPUT_CHECKPOINT_DIR environment variable. FlexAI’s Managed Checkpoints will handle the rest.

Once the Training Job is running, every time its code calls the torch.save() function, FlexAI’s Managed Checkpoints feature will automatically capture a Checkpoint and store it in the /output-checkpoint directory.Each Checkpoint will be assigned a unique ID and its creation time will be recorded.This means that you can go to a specific point in time and retrieve the state of the model at that moment, allowing you to resume training from that point or evaluate the model’s performance on a validation dataset.After a Training Job completes, the last Checkpoint will be the one with the most recent creation timestamp.

You can retrieve checkpoints generated by FlexAI’s Managed Checkpoints at any point, which allows you to go back to a previous point in the past to resume training, to test your model, or to use it for inference.

Listing Checkpoints

You can list all available checkpoints for a specific Training Job by running the flexai training checkpoints command:

flexai training checkpoints quickstart-training-job

This will return a table with a list of Checkpoint IDs and their corresponding creation timestamps, similar to the following:

 ID                                   │ TIMESTAMP
──────────────────────────────────────┼────────────────────────────────────
 50e5ec69-32b6-e483-9c49-38a73cc34294 │ 2025-06-30 12:42:55.214 +0100 WEST
 82d21263-8ba8-dd73-9c61-732d3b7b0adc │ 2025-06-30 12:43:01.77 +0100 WEST
 32d07a60-61cc-4598-b4f6-2073a4f8d0af │ 2025-06-30 12:43:14.734 +0100 WEST

Getting Checkpoints

Using the FlexAI Console
Using the FlexAI CLI

You can select the gear icon ⚙️ (labeled as Configure) in the Actions field of the Training Jobs list page. This will open a “Details” panel. The Details tab will be selected by default, showing all the relevant information about your Training Job.Navigate to the Checkpoints tab to view the list of checkpoints created during the Training Job. Each checkpoint entry includes details such as:

Actions:
- Download
- Deploy: If the Checkpoint is an “Inference-ready Checkpoint”, you can deploy it directly as an Inference Endpoint
Created:
- Creation date and age
Training Loss: The reported training loss at the time the checkpoint was created
Evaluation Loss: The reported evaluation loss at the time the checkpoint was created
Status: The status of the checkpoint (available or processing)

Once you have the desired Checkpoint ID, you can download it to your host machine using the flexai checkpoint fetch command:

flexai checkpoint fetch 32d07a60-61cc-4598-b4f6-2073a4f8d0af

Writing in:  /home/diego/ckpt.pt
Progress: 0.4% (1.31 MB / 343.79 MB)
// ...
Progress: 100% (343.79 MB / 343.79 MB)

You can use this checkpoint file to resume training from the exact point it was saved, or to evaluate the model’s performance on a validation dataset.

Use the flexai checkpoint export command to export a Checkpoint to a remote location by using a previously registered Remote Storage Connection, such as S3, GCS, MinIO or R2.This allows you to store your checkpoints in a more permanent location for later use.

Workload Raw Outputs

Using the FlexAI Console
Using the FlexAI CLI

This feature is not available on the FlexAI Console.

Any data written to the /output directory will be compressed into a zip file and made available to you via the flexai training fetch command:

flexai training fetch quickstart-training-job

This will download a .zip file to the current working directory on your host machine.Once extracted you’ll get a local directory named output it will contain any files written to the /output directory by the training scripts.

For this quickstart example we configured the training code to write to the checkpoints directory /output-checkpoint directory when we set the value of --out_dir, so the /output directory won’t be used.However, you can run another Training Job that instead writes out to the /output directory. In this case, only the last checkpoint will be saved in the /output directory as a result of the training process.Note that your code can use both locations (/output and /output-checkpoint) simultaneously if needed.

Getting Started

Inference

Fine-tuning

Training

Platform Services

Interactive Development

CLI

Console

Best Practices

FAQ

Getting a Fine-tuning Job's Output

FlexAI Managed Checkpoints

Listing Checkpoints

Getting Checkpoints

Workload Raw Outputs

Getting Started

Inference

Fine-tuning

Training

Platform Services

Interactive Development

CLI

Console

Best Practices

FAQ

​FlexAI Managed Checkpoints

​Listing Checkpoints

​Getting Checkpoints

​Workload Raw Outputs

FlexAI Managed Checkpoints

Listing Checkpoints

Getting Checkpoints

Workload Raw Outputs