FlexAI Checkpoints: In Practice
FlexAI uses a directory-based approach for managing Checkpoints, which simplifies the process of saving and loading checkpoint files across a large set of libraries.
Environment Variables
Section titled “Environment Variables”Environment variables are exposed in the Runtime environments to help you identify where to save and load checkpoints from.
| Variable | Path | Description |
|---|---|---|
FLEXAI_INPUT_CHECKPOINT_DIR | /input-checkpoint/ | Directory where the selected Checkpoint is mounted |
FLEXAI_OUTPUT_CHECKPOINT_DIR | /output-checkpoint/ | Directory where your code should write Checkpoints |
Checkpoint Creation & Management
Section titled “Checkpoint Creation & Management”The FlexAI Checkpoint Manager expects your code to write checkpoints during execution to the output directory the FLEXAI_OUTPUT_CHECKPOINT_DIR environment variable points to. Once there, the FlexAI Checkpoint Manager take care of them:
Directory/output-checkpoint/
Directory<checkpoint_1_name>/
- …
Directory<checkpoint_2_name>/
- …
Each sub-directory inside /output-checkpoint/ is treated as a complete and versioned checkpoint.
This folder-based approach ensures compatibility with a wide range of libraries, especially in the Hugging Face ecosystem. It makes it easy to resume training or launch inference jobs.
Supported Libraries
Section titled “Supported Libraries”FlexAI automatically detects and tracks checkpoints emitted through:
- Hugging Face
safetensors🔗 - Hugging Face
accelerate🔗 - Hugging Face
transformers🔗 (e.g.,Trainer.save_model🔗)- Any other libraries built on top of the Hugging Face
transformersecosystem, such as:
- Any other libraries built on top of the Hugging Face
Using Hugging Face Transformers
Section titled “Using Hugging Face Transformers”If you’re saving checkpoints at different steps, you should create a new sub-directory for each one:
import osfrom pathlib import Pathfrom transformers import Trainer, TrainingArguments
checkpoint_dir = os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"]
training_args = TrainingArguments( **output_dir=checkpoint_dir,**)trainer = Trainer( model=custom_model, args=training_args,)trainer.train()trainer.save_model()The Transformers library’s Trainer class is designed so you only need to set the TrainingArguments output_dir argument to /output-checkpoint/. This way, the main checkpoint will be saved to the root of /output-checkpoint/, and subsequent checkpoints will be saved as individual sub-directories.
File Structure
Section titled “File Structure”Directory/output-checkpoint/
- config.json
- pytorch_model.bin
- tokenizer.json
Directorystep-2000/
- config.json
- pytorch_model.bin
- tokenizer.json
Note that it is not uncommon for a Transformers-based framework to write a special Checkpoint in the root of the destination path, this being either what it considers the “latest” or “best” Checkpoint.
Using PyTorch’s torch.save function
Section titled “Using PyTorch’s torch.save function”The FlexAI Checkpoint Manager also supports two common patterns when using PyTorch’s torch.save function to save model checkpoints: The legacy flat file mode and the recommended directory-based mode.
Directory-based mode (recommended)
Section titled “Directory-based mode (recommended)”Create a sub-directory inside /output-checkpoint and save your files there. Everything inside the sub-directory is grouped and tracked as a single checkpoint:
import osfrom pathlib import Pathimport torch
output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])ckpt_path = output_dir / "step-1000"ckpt_path.mkdir(parents=True, exist_ok=True)
torch.save(model.state_dict(), ckpt_path / "model.pt")Flat file mode (legacy)
Section titled “Flat file mode (legacy)”You can also write files directly into /output-checkpoint. In this case, each file is treated as its own checkpoint. This works, but may lead to ambiguity if other folder-based checkpoints are present, so it is discouraged.
import osfrom pathlib import Pathimport torch
output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])ckpt_path = output_dir / "step-1000"
torch.save(model.state_dict(), ckpt_path)Checkpoint Loading
Section titled “Checkpoint Loading”Input Checkpoint
Section titled “Input Checkpoint”While building the Training or Fine-tuning workload runtime environments, FlexAI mounts the Checkpoint specified during the resource creation process under the path specified by the FLEXAI_INPUT_CHECKPOINT_DIR environment variable:
Adding a Checkpoint to a Fine-tuning workload
In the Start a new Fine-tuning Job 🔗 form of the FlexAI Console, you can select a Checkpoint from the Checkpoint menu.
When creating a Fine-tuning workload using the FlexAI CLI, you can specify a Checkpoint by passing its UUID or name to the -C/--checkpoint flag:
flexai training run <fine_tuning_job_name> \ --checkpoint <checkpoint_name_or_uuid> \ ...Example:
flexai training run ft_mistral7b_01 \ --dataset FreedomIntelligence__medical-o1-reasoning-SFT=medical \ --repository-url https://github.com/funnierinspanish/Mistral-7b_01-med-ft \ --repository-revision new_med_settings \ --checkpoint=mistral7b_01_base \ --secret HF_TOKEN=hf_token \ --accels 4 \ -- fine-tune.py mistral_config --dataset=medicalAssuming you picked a Checkpoint named mistral7b_01_base, the directory structure would look like this:
Directory/input-checkpoint/
Directorymistral7b_01_base/
- config.json
- pytorch_model.bin
- tokenizer.json
Loading a Checkpoints
Section titled “Loading a Checkpoints”Your code should load the Checkpoint directly from FLEXAI_INPUT_CHECKPOINT_DIR/<checkpoint_name>:
import osfrom transformers import AutoModelForCausalLM
ckpt_input = os.environ["FLEXAI_INPUT_CHECKPOINT_DIR"]ckpt_input_dir = os.path.join(ckpt_input, "mistral7b_01_base")model = AutoModelForCausalLM.from_pretrained(ckpt_input_dir)Now you can continue your Training Job from the loaded Checkpoint or Fine-tune it further.