Checkpoint Manager

The FlexAI Checkpoint Manager makes automates the capture, storing, parsing and versioning of FlexAI-generated Checkpoints via Training or Fine-tuning Jobs.

You can use the FlexAI Checkpoint Manager to manually push existing Checkpoints to FlexAI, to fetch Checkpoints to a local machine, or export them to external supported Cloud Storage Providers.

Checkpoints

FlexAI Managed Checkpoints are a powerful feature that automatically saves and manages checkpoints during a Training Job’s execution.

This feature has been designed to simplify the process of saving and retrieving the state of a model during training, allowing you to easily resume training from a specific point in time to evaluate the model’s performance, go back in time to re-start a Training Job from a Checkpoint before an interruption, or even roll back to a previous state if needed.

Key Features

Managed Checkpoints

Automatic checkpoint capture and secure storage without manual intervention required

Live Checkpoint Access

Access checkpoints from running Training and Fine-tuning Jobs for real-time evaluation and analysis

Flexible Export

Export checkpoints to supported Cloud Storage Providers or download them locally

Training Continuity

Resume Training and Fine-tuning Jobs from a checkpoint with full state preservation

Training Integration

Resume Training

Continue training from any checkpoint with full state preservation:

flexai training run continued-training \
  --checkpoint 3784735b-d7c6-4978-bd76-c6e9158d2ccc \
  --dataset fineweb \
  --repository-url https://github.com/flexaihq/flexbasemodel_t2t \
  -- train.py --continue_from=3784735b-d7c6-4978-bd76-c6e9158d2ccc --checkpoint_path=/input-checkpoint/

The 3784735b-d7c6-4978-bd76-c6e9158d2ccc UUID corresponds to the unique identifier the FlexAI Checkpoint Manager assigned to the Checkpoint when it was created. You can also use a Checkpoint name if you manually pushed it.

Fine-Tuning Workflows

Use checkpoints as the starting point of a Fine-tuning Job:

flexai training run fine-tune-from-checkpoint \
  --checkpoint mistral_base_checkpoint \
  --dataset fineweb \
  --repository-url https://github.com/flexaihq/mistral-ft \
  -- fine-tune.py --checkpoint_name=mistral_base_checkpoint --checkpoint_path=/input-checkpoint/

The mistral_base_checkpoint name corresponds to the name manually assigned to the Checkpoint when it was pushed it to FlexAI. You can also use a Checkpoint UUID from a Checkpoint captured by the FlexAI Checkpoint Manager.

As you can see, the workflow is exactly the same in both cases. The only difference lies in the script you pass as entry point. This script will be responsible for loading the Checkpoint from the /input-checkpoint/ path and either continuing a previous training Job or initiating a model Fine-tuning task.