Environment Variables
Environment variables are exposed in the Runtime environments to help you identify where to save and load checkpoints from.| Variable | Path | Description |
|---|---|---|
FLEXAI_INPUT_CHECKPOINT_DIR | /input-checkpoint/ | Directory where the selected Checkpoint is mounted |
FLEXAI_OUTPUT_CHECKPOINT_DIR | /output-checkpoint/ | Directory where your code should write Checkpoints |
Checkpoint Creation & Management
The FlexAI Checkpoint Manager expects your code to write checkpoints during execution to the output directory theFLEXAI_OUTPUT_CHECKPOINT_DIR environment variable points to. Once there, the FlexAI Checkpoint Manager take care of them:
/output-checkpoint/ is treated as a complete and versioned checkpoint.
This folder-based approach ensures compatibility with a wide range of libraries, especially in the Hugging Face ecosystem. It makes it easy to resume training or launch inference jobs.
Supported Libraries
FlexAI automatically detects and tracks checkpoints emitted through:- Hugging Face
safetensors - Hugging Face
accelerate - Hugging Face
transformers(e.g.,Trainer.save_model)
Using Hugging Face Transformers
If you’re saving checkpoints at different steps, you should create a new sub-directory for each one:save_checkpoint_at_different_steps.py
Trainer class is designed so you only need to set the TrainingArguments output_dir argument to /output-checkpoint/. This way, the main checkpoint will be saved to the root of /output-checkpoint/, and subsequent checkpoints will be saved as individual sub-directories.
File Structure
Note that it is not uncommon for a Transformers-based framework to write a special Checkpoint in the root of the destination path, this being either what it considers the “latest” or “best” Checkpoint.
Using PyTorch’s torch.save function
The FlexAI Checkpoint Manager also supports two common patterns when using PyTorch’s torch.save function to save model checkpoints: The legacy flat file mode and the recommended directory-based mode.
Directory-based mode (recommended)
Create a sub-directory inside/output-checkpoint and save your files there. Everything inside the sub-directory is grouped and tracked as a single checkpoint:
torch_save__directory_based.py
Flat file mode (legacy)
You can also write files directly into/output-checkpoint. In this case, each file is treated as its own checkpoint. This works, but may lead to ambiguity if other folder-based checkpoints are present, so it is discouraged.
torch_save__flat_file.py
Checkpoint Loading
Input Checkpoint
While building the Training or Fine-tuning workload runtime environments, FlexAI mounts the Checkpoint specified during the resource creation process under the path specified by theFLEXAI_INPUT_CHECKPOINT_DIR environment variable:
Adding a Checkpoint to a Fine-tuning workload
Adding a Checkpoint to a Fine-tuning workload
- Using the FlexAI Console
- Using the FlexAI CLI
In the Start a new Fine-tuning Job form of the FlexAI Console, you can select a Checkpoint from the Checkpoint menu.
mistral7b_01_base, the directory structure would look like this:
Loading a Checkpoints
Your code should load the Checkpoint directly fromFLEXAI_INPUT_CHECKPOINT_DIR/<checkpoint_name>:
load_checkpoint.py