Managed Checkpoints
Automatic checkpoint capture and secure storage without manual intervention required
The FlexAI Checkpoint Manager makes automates the capture, storing, parsing and versioning of FlexAI-generated Checkpoints via Training or Fine-tuning Jobs.
You can use the FlexAI Checkpoint Manager to manually push existing Checkpoints to FlexAI, to fetch Checkpoints to a local machine, or export them to external supported Cloud Storage Providers.
FlexAI Managed Checkpoints are a powerful feature that automatically saves and manages checkpoints during a Training Job’s execution.
This feature has been designed to simplify the process of saving and retrieving the state of a model during training, allowing you to easily resume training from a specific point in time to evaluate the model’s performance, go back in time to re-start a Training Job from a Checkpoint before an interruption, or even roll back to a previous state if needed.
Managed Checkpoints
Automatic checkpoint capture and secure storage without manual intervention required
Live Checkpoint Access
Access checkpoints from running Training and Fine-tuning Jobs for real-time evaluation and analysis
Flexible Export
Export checkpoints to supported Cloud Storage Providers or download them locally
Training Continuity
Resume Training and Fine-tuning Jobs from a checkpoint with full state preservation
Continue training from any checkpoint with full state preservation:
flexai training run continued-training \ --checkpoint 3784735b-d7c6-4978-bd76-c6e9158d2ccc \ --dataset fineweb \ --repository-url https://github.com/flexaihq/flexbasemodel_t2t \ -- train.py --continue_from=3784735b-d7c6-4978-bd76-c6e9158d2ccc --checkpoint_path=/input-checkpoint/The
3784735b-d7c6-4978-bd76-c6e9158d2cccUUID corresponds to the unique identifier the FlexAI Checkpoint Manager assigned to the Checkpoint when it was created. You can also use a Checkpoint name if you manually pushed it.
Use checkpoints as the starting point of a Fine-tuning Job:
flexai training run fine-tune-from-checkpoint \ --checkpoint mistral_base_checkpoint \ --dataset fineweb \ --repository-url https://github.com/flexaihq/mistral-ft \ -- fine-tune.py --checkpoint_name=mistral_base_checkpoint --checkpoint_path=/input-checkpoint/The
mistral_base_checkpointname corresponds to the name manually assigned to the Checkpoint when it was pushed it to FlexAI. You can also use a Checkpoint UUID from a Checkpoint captured by the FlexAI Checkpoint Manager.
As you can see, the workflow is exactly the same in both cases. The only difference lies in the script you pass as entry point. This script will be responsible for loading the Checkpoint from the /input-checkpoint/ path and either continuing a previous training Job or initiating a model Fine-tuning task.