FlexAI Checkpoints: In Practice

FlexAI uses a directory-based approach for managing Checkpoints, which simplifies the process of saving and loading checkpoint files across a large set of libraries.

Environment Variables

Environment variables are exposed in the Runtime environments to help you identify where to save and load checkpoints from.

Variable	Path	Description
`FLEXAI_INPUT_CHECKPOINT_DIR`	`/input-checkpoint/`	Directory where the selected Checkpoint is mounted
`FLEXAI_OUTPUT_CHECKPOINT_DIR`	`/output-checkpoint/`	Directory where your code should write Checkpoints

Checkpoint Creation & Management

The FlexAI Checkpoint Manager expects your code to write checkpoints during execution to the output directory the FLEXAI_OUTPUT_CHECKPOINT_DIR environment variable points to. Once there, the FlexAI Checkpoint Manager take care of them:

Directory/output-checkpoint/
- Directory<checkpoint_1_name>/
  - …
- Directory<checkpoint_2_name>/
  - …

Each sub-directory inside /output-checkpoint/ is treated as a complete and versioned checkpoint.

This folder-based approach ensures compatibility with a wide range of libraries, especially in the Hugging Face ecosystem. It makes it easy to resume training or launch inference jobs.

Supported Libraries

FlexAI automatically detects and tracks checkpoints emitted through:

Hugging Face safetensors 🔗
Hugging Face accelerate 🔗
Hugging Face transformers 🔗 (e.g., Trainer.save_model 🔗)
- Any other libraries built on top of the Hugging Face transformers ecosystem, such as:

Using Hugging Face Transformers

If you’re saving checkpoints at different steps, you should create a new sub-directory for each one:

import os
from pathlib import Path
from transformers import Trainer, TrainingArguments

checkpoint_dir = os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"]

training_args = TrainingArguments(
    **output_dir=checkpoint_dir,**
)
trainer = Trainer(
    model=custom_model,
    args=training_args,
)
trainer.train()
trainer.save_model()

The Transformers library’s Trainer class is designed so you only need to set the TrainingArguments output_dir argument to /output-checkpoint/. This way, the main checkpoint will be saved to the root of /output-checkpoint/, and subsequent checkpoints will be saved as individual sub-directories.

File Structure

Directory/output-checkpoint/
- config.json
- pytorch_model.bin
- tokenizer.json
- Directorystep-2000/
  - config.json
  - pytorch_model.bin
  - tokenizer.json

Note that it is not uncommon for a Transformers-based framework to write a special Checkpoint in the root of the destination path, this being either what it considers the “latest” or “best” Checkpoint.

Using PyTorch’s `torch.save` function

The FlexAI Checkpoint Manager also supports two common patterns when using PyTorch’s torch.save function to save model checkpoints: The legacy flat file mode and the recommended directory-based mode.

Directory-based mode (recommended)

Create a sub-directory inside /output-checkpoint and save your files there. Everything inside the sub-directory is grouped and tracked as a single checkpoint:

import os
from pathlib import Path
import torch

output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])
ckpt_path = output_dir / "step-1000"
ckpt_path.mkdir(parents=True, exist_ok=True)

torch.save(model.state_dict(), ckpt_path / "model.pt")

Flat file mode (legacy)

You can also write files directly into /output-checkpoint. In this case, each file is treated as its own checkpoint. This works, but may lead to ambiguity if other folder-based checkpoints are present, so it is discouraged.

import os
from pathlib import Path
import torch

output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])
ckpt_path = output_dir / "step-1000"

torch.save(model.state_dict(), ckpt_path)

Checkpoint Loading

Input Checkpoint

While building the Training or Fine-tuning workload runtime environments, FlexAI mounts the Checkpoint specified during the resource creation process under the path specified by the FLEXAI_INPUT_CHECKPOINT_DIR environment variable:

Adding a Checkpoint to a Fine-tuning workload

Using the FlexAI Console
Using the FlexAI CLI

In the Start a new Fine-tuning Job 🔗 form of the FlexAI Console, you can select a Checkpoint from the Checkpoint menu.

When creating a Fine-tuning workload using the FlexAI CLI, you can specify a Checkpoint by passing its UUID or name to the -C/--checkpoint flag:

flexai training run <fine_tuning_job_name> \
  --checkpoint <checkpoint_name_or_uuid> \
  ...

Example:

flexai training run ft_mistral7b_01 \
  --dataset FreedomIntelligence__medical-o1-reasoning-SFT=medical \
  --repository-url https://github.com/funnierinspanish/Mistral-7b_01-med-ft \
  --repository-revision new_med_settings \
  --checkpoint=mistral7b_01_base \
  --secret HF_TOKEN=hf_token \
  --accels 4 \
  -- fine-tune.py mistral_config --dataset=medical

Assuming you picked a Checkpoint named mistral7b_01_base, the directory structure would look like this:

Directory/input-checkpoint/
- Directorymistral7b_01_base/
  - config.json
  - pytorch_model.bin
  - tokenizer.json

Loading a Checkpoints

Your code should load the Checkpoint directly from FLEXAI_INPUT_CHECKPOINT_DIR/<checkpoint_name>:

import os
from transformers import AutoModelForCausalLM

ckpt_input = os.environ["FLEXAI_INPUT_CHECKPOINT_DIR"]
ckpt_input_dir = os.path.join(ckpt_input, "mistral7b_01_base")
model = AutoModelForCausalLM.from_pretrained(ckpt_input_dir)

Now you can continue your Training Job from the loaded Checkpoint or Fine-tune it further.