Skip to content

FlexAI Checkpoints: In Practice

FlexAI uses a directory-based approach for managing Checkpoints, which simplifies the process of saving and loading checkpoint files across a large set of libraries.

Environment variables are exposed in the Runtime environments to help you identify where to save and load checkpoints from.

VariablePathDescription
FLEXAI_INPUT_CHECKPOINT_DIR/input-checkpoint/Directory where the selected Checkpoint is mounted
FLEXAI_OUTPUT_CHECKPOINT_DIR/output-checkpoint/Directory where your code should write Checkpoints

The FlexAI Checkpoint Manager expects your code to write checkpoints during execution to the output directory the FLEXAI_OUTPUT_CHECKPOINT_DIR environment variable points to. Once there, the FlexAI Checkpoint Manager take care of them:

  • Directory/output-checkpoint/
    • Directory<checkpoint_1_name>/
    • Directory<checkpoint_2_name>/

Each sub-directory inside /output-checkpoint/ is treated as a complete and versioned checkpoint.

This folder-based approach ensures compatibility with a wide range of libraries, especially in the Hugging Face ecosystem. It makes it easy to resume training or launch inference jobs.

FlexAI automatically detects and tracks checkpoints emitted through:

If you’re saving checkpoints at different steps, you should create a new sub-directory for each one:

save_checkpoint_at_different_steps.py
import os
from pathlib import Path
from transformers import Trainer, TrainingArguments
checkpoint_dir = os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"]
training_args = TrainingArguments(
**output_dir=checkpoint_dir,**
)
trainer = Trainer(
model=custom_model,
args=training_args,
)
trainer.train()
trainer.save_model()

The Transformers library’s Trainer class is designed so you only need to set the TrainingArguments output_dir argument to /output-checkpoint/. This way, the main checkpoint will be saved to the root of /output-checkpoint/, and subsequent checkpoints will be saved as individual sub-directories.

  • Directory/output-checkpoint/
    • config.json
    • pytorch_model.bin
    • tokenizer.json
    • Directorystep-2000/
      • config.json
      • pytorch_model.bin
      • tokenizer.json

Note that it is not uncommon for a Transformers-based framework to write a special Checkpoint in the root of the destination path, this being either what it considers the “latest” or “best” Checkpoint.


The FlexAI Checkpoint Manager also supports two common patterns when using PyTorch’s torch.save function to save model checkpoints: The legacy flat file mode and the recommended directory-based mode.

Create a sub-directory inside /output-checkpoint and save your files there. Everything inside the sub-directory is grouped and tracked as a single checkpoint:

torch_save__directory_based.py
import os
from pathlib import Path
import torch
output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])
ckpt_path = output_dir / "step-1000"
ckpt_path.mkdir(parents=True, exist_ok=True)
torch.save(model.state_dict(), ckpt_path / "model.pt")

You can also write files directly into /output-checkpoint. In this case, each file is treated as its own checkpoint. This works, but may lead to ambiguity if other folder-based checkpoints are present, so it is discouraged.

torch_save__flat_file.py
import os
from pathlib import Path
import torch
output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])
ckpt_path = output_dir / "step-1000"
torch.save(model.state_dict(), ckpt_path)

While building the Training or Fine-tuning workload runtime environments, FlexAI mounts the Checkpoint specified during the resource creation process under the path specified by the FLEXAI_INPUT_CHECKPOINT_DIR environment variable:


Adding a Checkpoint to a Fine-tuning workload

In the Start a new Fine-tuning Job 🔗 form of the FlexAI Console, you can select a Checkpoint from the Checkpoint menu.


Assuming you picked a Checkpoint named mistral7b_01_base, the directory structure would look like this:

  • Directory/input-checkpoint/
    • Directorymistral7b_01_base/
      • config.json
      • pytorch_model.bin
      • tokenizer.json

Your code should load the Checkpoint directly from FLEXAI_INPUT_CHECKPOINT_DIR/<checkpoint_name>:

load_checkpoint.py
import os
from transformers import AutoModelForCausalLM
ckpt_input = os.environ["FLEXAI_INPUT_CHECKPOINT_DIR"]
ckpt_input_dir = os.path.join(ckpt_input, "mistral7b_01_base")
model = AutoModelForCausalLM.from_pretrained(ckpt_input_dir)

Now you can continue your Training Job from the loaded Checkpoint or Fine-tune it further.