2025-01-31
Highlights
- [New Feature] Managed Checkpoints: Simplify and accelerate your AI training workflows with secure, scalable, and reusable snapshots of model states.
- [New Feature] Live Checkpoint Capture: List checkpoints from an in-progress or completed Training Job and reuse it for evaluation or further fine-tuning without waiting for downloads or uploads.
- [Improved UX] Training Logs and Output Handling: Training logs now clearly indicate their start and end, and outputs retain directory structures when fetched.
Added
- Managed Checkpoints: Manage checkpoints using the
flexai checkpoint
command checkpoints to list, upload, download, and export them. List checkpoints from an in-progress or completed Training Job and reuse it for evaluation or resume the training using the--checkpoint <checkpoint_name_or_UID>
flag without any manual data transfer. - Fine-Tuning: Bring pre-trained models into FCS from a local machine or remote storage (e.g. AWS) once and reuse them for any of your fine-tuning workloads, saving on egress fees and transfer times.
- Inspect checkpoints: View detailed checkpoint information, including creation time and metadata, using
checkpoint inspect
. - Check for updates: Use the
doctor
command to check for CLI updates and stay up-to-date with the latest features. - Multi-GPU Interactive Training: Enhanced Interactive Training with support for multi-GPU single-node setups. Use
flexai training debug-ssh --accels <1 to 8>
for debug and optimization of distributed workloads.
Changed
-
Sorted Checkpoint List: The
training checkpoints
orcheckpoint list
sorts checkpoints by creation time in ascending order, making it easier to locate relevant checkpoints. -
Directory Retention: Outputs fetched from training jobs or checkpoints maintain their original directory structure, simplifying file management.
-
Improved Help Text: Help messages for
push
commands provide clearer guidance. -
Improved Training flow UX: Improved the
training run
output message to suggest the command to run to monitor its log output.flexai training run mistral_ft_34 ...
Training job mistral_ft_34.
Use 'flexai training logs mistral_ft_34' to follow the progress of your training job. -
Clear Logs Start:
training logs
clearly indicate the start point of the logs stream.
Fixed
- Logs Termination UX: The logs stream from
training logs
terminates at the end of the Training Job. - Relative Path Uploads: Fixed failures with path resolutions for dataset and checkpoint uploads.