2025-01-31

Highlights

[New Feature] Managed Checkpoints: Simplify and accelerate your AI training workflows with secure, scalable, and reusable snapshots of model states.
[New Feature] Live Checkpoint Capture: List checkpoints from an in-progress or completed Training Job and reuse it for evaluation or further fine-tuning without waiting for downloads or uploads.
[Improved UX] Training Logs and Output Handling: Training logs now clearly indicate their start and end, and outputs retain directory structures when fetched.

Added

Managed Checkpoints: Manage checkpoints using the flexai checkpoint command checkpoints to list, upload, download, and export them. List checkpoints from an in-progress or completed Training Job and reuse it for evaluation or resume the training using the --checkpoint <checkpoint_name_or_UID> flag without any manual data transfer.
Fine-Tuning: Bring pre-trained models into FCS from a local machine or remote storage (e.g. AWS) once and reuse them for any of your fine-tuning workloads, saving on egress fees and transfer times.
Inspect checkpoints: View detailed checkpoint information, including creation time and metadata, using checkpoint inspect.
Check for updates: Use the doctor command to check for CLI updates and stay up-to-date with the latest features.
Multi-GPU Interactive Training: Enhanced Interactive Training with support for multi-GPU single-node setups. Use flexai training debug-ssh --accels <1 to 8> for debug and optimization of distributed workloads.

Changed

Sorted Checkpoint List: The training checkpoints or checkpoint list sorts checkpoints by creation time in ascending order, making it easier to locate relevant checkpoints.
Directory Retention: Outputs fetched from training jobs or checkpoints maintain their original directory structure, simplifying file management.
Improved Help Text: Help messages for push commands provide clearer guidance.

Improved Training flow UX: Improved the training run output message to suggest the command to run to monitor its log output.

flexai training run mistral_ft_34 ... 
  
Training job mistral_ft_34.  
Use 'flexai training logs mistral_ft_34' to follow the progress of your training job.  

Clear Logs Start: training logs clearly indicate the start point of the logs stream.

Fixed

Logs Termination UX: The logs stream from training logs terminates at the end of the Training Job.
Relative Path Uploads: Fixed failures with path resolutions for dataset and checkpoint uploads.

Highlights​

Added​

Changed​

Fixed​

Highlights

Added

Changed

Fixed