Command: checkpoint
Checkpoints are an FCS entity that represents a snapshot of a model's state at a given point in time.
Checkpoints capture a model's state at various stages of training. These snapshots include model weights, optimizer state, and other relevant training data. This allows you to resume training from a specific point, preventing data loss and enabling experimentation with different training paths while helping you avoid unnecessarily repeating training iterations.
Checkpoints can be pushed to FCS directly from the host machine running the FlexAI CLI or from a Remote Storage Provider connection, such as Amazon S3, Cloudflare R2, or GCP Cloud Storage, among others. They can be individual files or entire directories.
You will find more information about Managed Checkpoints and the benefits they bring to your AI training workflows in the Managed Checkpoints guide.
You can manage Checkpoints using the flexai checkpoint
set of subcommands.
checkpoint delete
Deletes a checkpoint.
flexai checkpoint delete <checkpoint_name>
Arguments
Argument | Description | Example |
---|---|---|
checkpoint_name | The name of the Checkpoint resource | mistral-500-checkpoint |
Example
flexai checkpoint delete mistral-500-checkpoint
checkpoint export
Uploads a Checkpoint generated by a Training Job to a Remote Storage Provider connection, such as Amazon S3, Cloudflare R2, or GCP Cloud Storage.
flexai checkpoint export <checkpoint_UUID> --storage-provider <storage_provider_name> --destination-path <destination_path>
Arguments
Argument | Description | Example |
---|---|---|
checkpoint_UUID | UUID of a checkpoint generated by a Training Job. Training Job Checkpoints can be listed with the flexai training checkpoints command | 90c8f215-e131-4f9c-936c-12fe1fe9a6f1 |
Flags
Flag | Type | Optional / Required | Description | Example |
---|---|---|---|---|
--destination-path | String | Required | The destination path on the storage provider's bucket to upload the checkpoint files to | my-bucket/checkpoints/mistral-train/ |
--storage-provider | String | Required | The name of the Remote Storage Provider connection to be used to upload the files | aws-storage-conn-eu |
Example
flexai checkpoint export 90c8f215-e131-4f9c-936c-12fe1fe9a6f1 --storage-provider aws-storage-conn-eu --destination-path my-bucket/checkpoints/mistral-train/
checkpoint fetch
Downloads a Checkpoint from FCS to the host machine running the FlexAI CLI.
flexai checkpoint fetch ( <checkpoint_name> | <checkpoint_UUID> ) [ --destination <destination_path> ]
Arguments
Argument | Description | Example |
---|---|---|
checkpoint_name | The name of a Checkpoint created using flexai checkpoint push . Pushed Checkpoints can be listed using the flexai checkpoint list command | mistral-500-checkpoint |
checkpoint_UUID | UUID of a checkpoint generated by a Training Job. Training Job Checkpoints can be listed with the flexai training checkpoints command | 90c8f215-e131-4f9c-936c-12fe1fe9a6f1 |
Flags
Flag | Type | Optional / Required | Description | Example |
---|---|---|---|---|
-d , --destination | string | Optional | Destination path to save the checkpoint. The current working directory will be used by default | /saved-checkpoints/mistral/ |
Example
flexai checkpoint fetch mistral-500-checkpoint
checkpoint inspect
Returns detailed information about a Checkpoint: its inception time, metadata, status, file content list, and more. It outputs the information in YAML format by default.
flexai checkpoint inspect <checkpoint_name> [--json]
Arguments
Argument | Description | Example |
---|---|---|
checkpoint_name | The name of the Checkpoint resource. Must follow the FCS resource naming conventions | mistral-500-checkpoint |
Flags
Flag | Type | Optional / Required | Description | Example |
---|---|---|---|---|
--json | Flag | Optional | Output the information in JSON format | --json |
Example
flexai checkpoint inspect mistral-500-checkpoint
Details on the returned information
Returned information
Field | Description | Data Type |
---|---|---|
kind | The type of resource | String |
metadata | Metadata about the checkpoint | Object |
metadata.name | The name of the checkpoint | String |
metadata.id | The unique identifier of the checkpoint | String (UUID) |
metadata.creatorUserID | The user ID of the checkpoint creator | String (UUID) |
metadata.ownerOrgID | The organization ID that owns the checkpoint | String (UUID) |
spec | Checkpoint contents details | Object |
spec.fromLocalFiles | A list with the paths of the files used to create this checkpoint | String List |
spec.storageProvider | The name of the Remote Storage Provider connection, if any | String |
spec.sourcePath | The path to the bucket and file or directory on the Remote Storage Provider connection, if any | String |
status | Status information of the checkpoint | Object |
status.status | The current status of the checkpoint | String |
status.storageProviderID | The ID of the Remote Storage Provider connection used to upload the checkpoint, if any | String (UUID) |
status.size | The total size of the checkpoint | String (File Size) |
status.files | A list of files with their paths and sizes | Object List |
status.files.path | The path of the file within the checkpoint | String (File Path) |
status.files.size | The size of the file | String (File Size) |
status.createdAt | The timestamp when the checkpoint was created | String (ISO 8601) |
status.updatedAt | The timestamp when the checkpoint was last updated | String (ISO 8601) |
Example
kind: Checkpoint
metadata:
name: test-nanogpt-run-1
id: 431a0ecb-cd8f-4508-8ae3-0990833a4f16
creatorUserID: bd67af19-2599-4a57-832e-a1ac042f48be
ownerOrgID: 270a5476-b91a-442f-8a13-852ef7bb5b94
spec:
fromLocalFiles:
- output/checkpoint_20250203_130043/ckpt.pt
storageProvider: ""
sourcePath: ""
status:
status: available
files:
- path: ckpt.pt
size: 343.79 MB
storageProviderID: 00000000-0000-0000-0000-000000000000
size: 343.79 MB
createdAt: "2025-02-03T13:14:37.730471Z"
updatedAt: "2025-02-03T13:14:54.591148Z"
checkpoint list
Lists all the available Checkpoints.
flexai checkpoint list
Example
flexai checkpoint list
NAME | FILES COUNT | TOTAL SIZE | STATUS | CREATED AT
--------------------------------+-------------+------------+-----------+-------------
base-llama2-pretrained-custom | 1 | 17 GB | available | 16d
mistral-ft-owt | 2 | 122 GB | available | 2h
checkpoint push
Pushes a new Checkpoint to FCS from either the host machine running the FlexAI CLI or from a Remote Storage Provider connection, such as Amazon S3, Cloudflare R2 or GCP Cloud Storage.
flexai checkpoint push <checkpoint_name> [(--file <path_on_filesystem>=<checkpoint_path> ...) | (--storage-provider <storage_provider_name> --source-path <source_path>)]
Arguments
Argument | Description | Example |
---|---|---|
checkpoint_name | Resource name. Must follow the FCS resource naming conventions | mistral-500-checkpoint |
Flags
Flag | Type | Optional / Required | Description | Example |
---|---|---|---|---|
-f , --file | Key/Value mapping | Optional | Local source path and the destination path on FCS Storage to upload the checkpoint files in the format <source_path> = <fcs_checkpoint_path> | -f output/ckpt.pt=sd-xl-2500.ckpt , -f output/ckpt.pt |
--source-path | String | Optional | The path to the checkpoint files on the storage provider's bucket. It can be a single file or a directory | my-bucket/checkpoints/mistral-train/ , my-bucket/checkpoints/mistral-train/ckpt_15600.pt |
--storage-provider | String | Optional | The name of the Remote Storage Provider connection to be used to get the checkpoint file/s | aws-storage-conn-eu |
Example
Pushing a single checkpoint file from the host machine:
flexai checkpoint push mistral-500-checkpoint-local --file output/ckpt.pt=sd-xl-2500.ckpt
Pushing multiple checkpoint files from a Remote Storage Provider connection:
flexai checkpoint push mistral-500-checkpoint-s3 --storage-provider aws-storage-conn-eu --source-path my-bucket/checkpoints/mistral-train/