Skip to main content

Command: checkpoint

Checkpoints are an FCS entity that represents a snapshot of a model's state at a given point in time.

Checkpoints capture a model's state at various stages of training. These snapshots include model weights, optimizer state, and other relevant training data. This allows you to resume training from a specific point, preventing data loss and enabling experimentation with different training paths while helping you avoid unnecessarily repeating training iterations.

Checkpoints can be pushed to FCS directly from the host machine running the FlexAI CLI or from a Remote Storage Provider connection, such as Amazon S3, Cloudflare R2, or GCP Cloud Storage, among others. They can be individual files or entire directories.

You will find more information about Managed Checkpoints and the benefits they bring to your AI training workflows in the Managed Checkpoints guide.

You can manage Checkpoints using the flexai checkpoint set of subcommands.

checkpoint delete

Deletes a checkpoint.

flexai checkpoint delete <checkpoint_name>

Arguments

ArgumentDescriptionExample
checkpoint_nameThe name of the Checkpoint resourcemistral-500-checkpoint

Example

flexai checkpoint delete mistral-500-checkpoint

checkpoint export

Uploads a Checkpoint generated by a Training Job to a Remote Storage Provider connection, such as Amazon S3, Cloudflare R2, or GCP Cloud Storage.

flexai checkpoint export <checkpoint_UUID> --storage-provider <storage_provider_name>  --destination-path <destination_path>

Arguments

ArgumentDescriptionExample
checkpoint_UUIDUUID of a checkpoint generated by a Training Job. Training Job Checkpoints can be listed with the flexai training checkpoints command90c8f215-e131-4f9c-936c-12fe1fe9a6f1

Flags

FlagTypeOptional / RequiredDescriptionExample
--destination-pathStringRequiredThe destination path on the storage provider's bucket to upload the checkpoint files tomy-bucket/checkpoints/mistral-train/
--storage-providerStringRequiredThe name of the Remote Storage Provider connection to be used to upload the filesaws-storage-conn-eu

Example

flexai checkpoint export 90c8f215-e131-4f9c-936c-12fe1fe9a6f1 --storage-provider aws-storage-conn-eu --destination-path my-bucket/checkpoints/mistral-train/

checkpoint fetch

Downloads a Checkpoint from FCS to the host machine running the FlexAI CLI.

flexai checkpoint fetch ( <checkpoint_name> | <checkpoint_UUID> ) [ --destination <destination_path> ]

Arguments

ArgumentDescriptionExample
checkpoint_nameThe name of a Checkpoint created using flexai checkpoint push. Pushed Checkpoints can be listed using the flexai checkpoint list commandmistral-500-checkpoint
checkpoint_UUIDUUID of a checkpoint generated by a Training Job. Training Job Checkpoints can be listed with the flexai training checkpoints command90c8f215-e131-4f9c-936c-12fe1fe9a6f1

Flags

FlagTypeOptional / RequiredDescriptionExample
-d, --destinationstringOptionalDestination path to save the checkpoint. The current working directory will be used by default/saved-checkpoints/mistral/

Example

flexai checkpoint fetch mistral-500-checkpoint

checkpoint inspect

Returns detailed information about a Checkpoint: its inception time, metadata, status, file content list, and more. It outputs the information in YAML format by default.

flexai checkpoint inspect <checkpoint_name> [--json]

Arguments

ArgumentDescriptionExample
checkpoint_nameThe name of the Checkpoint resource. Must follow the FCS resource naming conventionsmistral-500-checkpoint

Flags

FlagTypeOptional / RequiredDescriptionExample
--jsonFlagOptionalOutput the information in JSON format--json

Example

flexai checkpoint inspect mistral-500-checkpoint
Details on the returned information

Returned information

FieldDescriptionData Type
kindThe type of resourceString
metadataMetadata about the checkpointObject
metadata.nameThe name of the checkpointString
metadata.idThe unique identifier of the checkpointString (UUID)
metadata.creatorUserIDThe user ID of the checkpoint creatorString (UUID)
metadata.ownerOrgIDThe organization ID that owns the checkpointString (UUID)
specCheckpoint contents detailsObject
spec.fromLocalFilesA list with the paths of the files used to create this checkpointString List
spec.storageProviderThe name of the Remote Storage Provider connection, if anyString
spec.sourcePathThe path to the bucket and file or directory on the Remote Storage Provider connection, if anyString
statusStatus information of the checkpointObject
status.statusThe current status of the checkpointString
status.storageProviderIDThe ID of the Remote Storage Provider connection used to upload the checkpoint, if anyString (UUID)
status.sizeThe total size of the checkpointString (File Size)
status.filesA list of files with their paths and sizesObject List
status.files.pathThe path of the file within the checkpointString (File Path)
status.files.sizeThe size of the fileString (File Size)
status.createdAtThe timestamp when the checkpoint was createdString (ISO 8601)
status.updatedAtThe timestamp when the checkpoint was last updatedString (ISO 8601)

Example

kind: Checkpoint
metadata:
name: test-nanogpt-run-1
id: 431a0ecb-cd8f-4508-8ae3-0990833a4f16
creatorUserID: bd67af19-2599-4a57-832e-a1ac042f48be
ownerOrgID: 270a5476-b91a-442f-8a13-852ef7bb5b94
spec:
fromLocalFiles:
- output/checkpoint_20250203_130043/ckpt.pt
storageProvider: ""
sourcePath: ""
status:
status: available
files:
- path: ckpt.pt
size: 343.79 MB
storageProviderID: 00000000-0000-0000-0000-000000000000
size: 343.79 MB
createdAt: "2025-02-03T13:14:37.730471Z"
updatedAt: "2025-02-03T13:14:54.591148Z"

checkpoint list

Lists all the available Checkpoints.

flexai checkpoint list

Example

flexai checkpoint list

NAME | FILES COUNT | TOTAL SIZE | STATUS | CREATED AT
--------------------------------+-------------+------------+-----------+-------------
base-llama2-pretrained-custom | 1 | 17 GB | available | 16d
mistral-ft-owt | 2 | 122 GB | available | 2h

checkpoint push

Pushes a new Checkpoint to FCS from either the host machine running the FlexAI CLI or from a Remote Storage Provider connection, such as Amazon S3, Cloudflare R2 or GCP Cloud Storage.

flexai checkpoint push <checkpoint_name> [(--file <path_on_filesystem>=<checkpoint_path> ...) | (--storage-provider <storage_provider_name> --source-path <source_path>)]

Arguments

ArgumentDescriptionExample
checkpoint_nameResource name. Must follow the FCS resource naming conventionsmistral-500-checkpoint

Flags

FlagTypeOptional / RequiredDescriptionExample
-f, --fileKey/Value mappingOptionalLocal source path and the destination path on FCS Storage to upload the checkpoint files in the format <source_path> = <fcs_checkpoint_path>-f output/ckpt.pt=sd-xl-2500.ckpt, -f output/ckpt.pt
--source-pathStringOptionalThe path to the checkpoint files on the storage provider's bucket. It can be a single file or a directorymy-bucket/checkpoints/mistral-train/, my-bucket/checkpoints/mistral-train/ckpt_15600.pt
--storage-providerStringOptionalThe name of the Remote Storage Provider connection to be used to get the checkpoint file/saws-storage-conn-eu

Example

Pushing a single checkpoint file from the host machine:

flexai checkpoint push mistral-500-checkpoint-local --file output/ckpt.pt=sd-xl-2500.ckpt

Pushing multiple checkpoint files from a Remote Storage Provider connection:

flexai checkpoint push mistral-500-checkpoint-s3 --storage-provider aws-storage-conn-eu --source-path my-bucket/checkpoints/mistral-train/