Skip to main content

2025-01-31

Highlights

  • [New Feature] Managed Checkpoints: Simplify and accelerate your AI training workflows with secure, scalable, and reusable snapshots of model states.
  • [New Feature] Live Checkpoint Capture: List checkpoints from an in-progress or completed Training Job and reuse it for evaluation or further fine-tuning without waiting for downloads or uploads.
  • [Improved UX] Training Logs and Output Handling: Training logs now clearly indicate their start and end, and outputs retain directory structures when fetched.

Added

  • Managed Checkpoints: Manage checkpoints using the flexai checkpoint command checkpoints to list, upload, download, and export them. List checkpoints from an in-progress or completed Training Job and reuse it for evaluation or resume the training using the --checkpoint <checkpoint_name_or_UID> flag without any manual data transfer.
  • Fine-Tuning: Bring pre-trained models into FCS from a local machine or remote storage (e.g. AWS) once and reuse them for any of your fine-tuning workloads, saving on egress fees and transfer times.
  • Inspect checkpoints: View detailed checkpoint information, including creation time and metadata, using checkpoint inspect.
  • Check for updates: Use the doctor command to check for CLI updates and stay up-to-date with the latest features.
  • Multi-GPU Interactive Training: Enhanced Interactive Training with support for multi-GPU single-node setups. Use flexai training debug-ssh --accels <1 to 8> for debug and optimization of distributed workloads.

Changed

  • Sorted Checkpoint List: The training checkpoints or checkpoint list sorts checkpoints by creation time in ascending order, making it easier to locate relevant checkpoints.

  • Directory Retention: Outputs fetched from training jobs or checkpoints maintain their original directory structure, simplifying file management.

  • Improved Help Text: Help messages for push commands provide clearer guidance.

  • Improved Training flow UX: Improved the training run output message to suggest the command to run to monitor its log output.

    flexai training run mistral_ft_34 ... 

    Training job mistral_ft_34.
    Use 'flexai training logs mistral_ft_34' to follow the progress of your training job.
  • Clear Logs Start: training logs clearly indicate the start point of the logs stream.

Fixed

  • Logs Termination UX: The logs stream from training logs terminates at the end of the Training Job.
  • Relative Path Uploads: Fixed failures with path resolutions for dataset and checkpoint uploads.