Skip to main content

2024-12-13

Highlights

  • [New feature] Interactive Training: Debug and optimize your Training Jobs faster. You can now SSH into a Training Runtime environment and interact with it in real-time, giving you more control.
  • [New feature] Infrastructure Metrics Dashboard: Optimize your batch sizes and performance. Access your dashboard to gain real-time insights into GPU utilization, memory usage, temperature, and other infrastructure metrics.

Added

  • Interactive Training: The flexai training debug-ssh command lets you SSH into a Training Runtime environment, providing real-time access to its file system.

    This feature can help you debug Training Jobs more efficiently, run detailed diagnostics, and perform unlimited tests to ensure everything is optimized before scaling up to large-scale production Training Jobs.

    For more details, visit the Interactive Training guide.

  • Infrastructure Metrics Dashboard: Access your dashboard to gain real-time insights into GPU utilization, memory usage, temperature, and other infrastructure metrics. Use this tool to optimize batch sizes, identify potential bottlenecks, and troubleshoot performance issues. Visit dashboards.flex.ai

  • flexai training run checks: Before starting a Training Job, FCS performs these sanity checks:

    • The --source-revision in the repository specified through the --source-name flag exists.
    • A requirements.txt file exists at the root of the source's revision.
    • The requirements.txt file either:
      • Doesn't specify versions for the following packages torch, torchaudio, and torchvision.
      • Doesn't include the torch, torchaudio, and torchvision packages.
    • The Entrypoint script path exists.

Changed

  • Enhanced CLI help text (--help) for flexai storage create & flexai dataset push commands to provide more detailed information on how to use them to push Datasets from a Remote Storage Connection to FCS.