> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# FlexAI Checkpoints: In Practice

> Practical guide to managing checkpoints in training and fine-tuning workflows

FlexAI uses a **directory-based** approach for managing Checkpoints, which simplifies the process of saving and loading checkpoint files across a large set of libraries.

## Environment Variables

Environment variables are exposed in the Runtime environments to help you identify where to save and load checkpoints from.

| Variable                       | Path                  | Description                                        |
| ------------------------------ | --------------------- | -------------------------------------------------- |
| `FLEXAI_INPUT_CHECKPOINT_DIR`  | `/input-checkpoint/`  | Directory where the selected Checkpoint is mounted |
| `FLEXAI_OUTPUT_CHECKPOINT_DIR` | `/output-checkpoint/` | Directory where your code should write Checkpoints |

## Checkpoint Creation & Management

The FlexAI Checkpoint Manager expects your code to write checkpoints during execution to the output directory the `FLEXAI_OUTPUT_CHECKPOINT_DIR` environment variable points to. Once there, the FlexAI Checkpoint Manager take care of them:

```text theme={null}
/output-checkpoint/
├── <checkpoint_1_name>/
│   └── ...
└── <checkpoint_2_name>/
    └── ...
```

Each sub-directory inside `/output-checkpoint/` is treated as a **complete** and **versioned checkpoint**.

This folder-based approach ensures compatibility with a wide range of libraries, especially in the Hugging Face ecosystem. It makes it easy to resume training or launch inference jobs.

### Supported Libraries

FlexAI automatically detects and tracks checkpoints emitted through:

* Hugging Face [`safetensors`](https://huggingface.co/docs/safetensors/en/index)
* Hugging Face [`accelerate`](https://huggingface.co/docs/accelerate/en/usage_guides/checkpoint)
* Hugging Face [`transformers`](https://huggingface.co/docs/transformers/en/index) (e.g., [`Trainer.save_model`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.save_model))
  * Any other libraries built on top of the Hugging Face `transformers` ecosystem, such as:
    * [`PEFT`](https://huggingface.co/docs/peft/en/developer_guides/checkpoint)
    * [`TRL`](https://huggingface.co/docs/trl/main/en/index)
    * [`Diffusers`](https://huggingface.co/docs/diffusers/en/index)

<Tip>
  As long as your code writes to a sub-directory of `/output-checkpoint/`, the FlexAI Checkpoint Manager will handle the rest.
</Tip>

#### Using Hugging Face Transformers

If you're saving checkpoints at different steps, you should create a new sub-directory for each one:

```python title="save_checkpoint_at_different_steps.py" theme={null}
import os
from pathlib import Path
from transformers import Trainer, TrainingArguments

checkpoint_dir = os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"]

training_args = TrainingArguments(
    output_dir=checkpoint_dir,  # highlight-line
)
trainer = Trainer(
    model=custom_model,
    args=training_args,
)
trainer.train()
trainer.save_model()
```

The Transformers library's `Trainer` class is designed so you only need to set the `TrainingArguments` `output_dir` argument to `/output-checkpoint/`. This way, the main checkpoint will be saved to the root of `/output-checkpoint/`, and subsequent checkpoints will be saved as individual sub-directories.

##### File Structure

```text theme={null}
/output-checkpoint/
├── config.json
├── pytorch_model.bin
├── tokenizer.json
└── step-2000/
    ├── config.json
    ├── pytorch_model.bin
    └── tokenizer.json
```

> Note that it is not uncommon for a Transformers-based framework to write a *special* Checkpoint in the root of the destination path, this being either what it considers the "latest" or "best" Checkpoint.

***

#### Using PyTorch's `torch.save` function

The FlexAI Checkpoint Manager also supports two common patterns when using PyTorch's `torch.save` function to save model checkpoints: The **legacy** flat file mode and the **recommended** directory-based mode.

##### Directory-based mode (recommended)

Create a sub-directory inside `/output-checkpoint` and save your files there. Everything inside the sub-directory is grouped and tracked as a single checkpoint:

```python title="torch_save__directory_based.py" theme={null}
import os
from pathlib import Path
import torch

output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])
ckpt_path = output_dir / "step-1000"
ckpt_path.mkdir(parents=True, exist_ok=True)

torch.save(model.state_dict(), ckpt_path / "model.pt")
```

##### Flat file mode (legacy)

You can also write files directly into `/output-checkpoint`. In this case, each file is treated as its own checkpoint. This works, but may lead to ambiguity if other folder-based checkpoints are present, so it is discouraged.

```python title="torch_save__flat_file.py" theme={null}
import os
from pathlib import Path
import torch

output_dir = Path(os.environ["FLEXAI_OUTPUT_CHECKPOINT_DIR"])
ckpt_path = output_dir / "step-1000"

torch.save(model.state_dict(), ckpt_path)
```

***

## Checkpoint Loading

### Input Checkpoint

While building the Training or Fine-tuning workload runtime environments, FlexAI mounts the **Checkpoint** specified during the resource creation process under the path specified by the `FLEXAI_INPUT_CHECKPOINT_DIR` environment variable:

***

<Accordion title="Adding a Checkpoint to a Fine-tuning workload">
  <Tabs>
    <Tab title="Using the FlexAI Console">
      In the [Start a new Fine-tuning Job](https://console.flex.ai/s/fine-tunings/new-fine-tuning) form of the FlexAI Console, you can select a Checkpoint from the **Checkpoint** menu.
    </Tab>

    <Tab title="Using the FlexAI CLI">
      When creating a Fine-tuning workload using the FlexAI CLI, you can specify a Checkpoint by passing its UUID or name to the `-C/--checkpoint` flag:

      ```bash theme={null}
      flexai training run <fine_tuning_job_name> \
        --checkpoint <checkpoint_name_or_uuid> \
        ...
      ```

      Example:

      ```bash title="Creating a Fine-tuning Job with a Checkpoint" theme={null}
      flexai training run ft_mistral7b_01 \
        --dataset FreedomIntelligence__medical-o1-reasoning-SFT=medical \
        --repository-url https://github.com/funnierinspanish/Mistral-7b_01-med-ft \
        --repository-revision new_med_settings \
        --checkpoint=mistral7b_01_base \
        --secret HF_TOKEN=hf_token \
        --accels 4 \
        -- fine-tune.py mistral_config --dataset=medical
      ```
    </Tab>
  </Tabs>
</Accordion>

Assuming you picked a Checkpoint named `mistral7b_01_base`, the directory structure would look like this:

```text theme={null}
/input-checkpoint/
└── mistral7b_01_base/
    ├── config.json
    ├── pytorch_model.bin
    └── tokenizer.json
```

### Loading a Checkpoints

Your code should load the Checkpoint directly from `FLEXAI_INPUT_CHECKPOINT_DIR/<checkpoint_name>`:

```python title="load_checkpoint.py" theme={null}
import os
from transformers import AutoModelForCausalLM

ckpt_input = os.environ["FLEXAI_INPUT_CHECKPOINT_DIR"]
ckpt_input_dir = os.path.join(ckpt_input, "mistral7b_01_base")
model = AutoModelForCausalLM.from_pretrained(ckpt_input_dir)
```

Now you can continue your Training Job from the loaded Checkpoint or Fine-tune it further.