Checkpoints
FlexAI Managed Checkpoints are a powerful feature that automatically saves and manages checkpoints during a Training Job’s execution. This feature has been designed to simplify the process of saving and retrieving the state of a model during training, allowing you to easily resume training from a specific point in time to evaluate the model’s performance, go back in time to re-start a Training Job from a Checkpoint before an interruption, or even roll back to a previous state if needed.Key Features
Managed Checkpoints
Automatic checkpoint capture and secure storage without manual intervention required
Live Checkpoint Access
Access checkpoints from running Training and Fine-tuning Jobs for real-time evaluation and analysis
Flexible Export
Export checkpoints to supported Cloud Storage Providers or download them locally
Training Continuity
Resume Training and Fine-tuning Jobs from a checkpoint with full state preservation
Training Integration
Resume Training
Continue training from any checkpoint with full state preservation:
The 3784735b-d7c6-4978-bd76-c6e9158d2ccc UUID corresponds to the unique identifier the FlexAI Checkpoint Manager assigned to the Checkpoint when it was created. You can also use a Checkpoint name if you manually pushed it.
Fine-Tuning Workflows
Use checkpoints as the starting point of a Fine-tuning Job:
The mistral_base_checkpoint name corresponds to the name manually assigned to the Checkpoint when it was pushed it to FlexAI. You can also use a Checkpoint UUID from a Checkpoint captured by the FlexAI Checkpoint Manager.
As you can see, the workflow is exactly the same in both cases. The only difference lies in the script you pass as entry point. This script will be responsible for loading the Checkpoint from the /input-checkpoint/ path and either continuing a previous training Job or initiating a model Fine-tuning task.