Changelog: 2024-12-13
Highlights
Section titled “Highlights”- [New feature] Interactive Training: Debug and optimize your Training Jobs faster. You can now SSH into a Training Runtime environment and interact with it in real-time, giving you more control.
- [New feature] Infrastructure Metrics Dashboard: Optimize your batch sizes and performance. Access your dashboard to gain real-time insights into GPU utilization, memory usage, temperature, and other infrastructure metrics.
-
Interactive Training: The
flexai training debug-sshcommand lets you SSH into a Training Runtime environment, providing real-time access to its file system.This feature can help you debug Training Jobs more efficiently, run detailed diagnostics, and perform unlimited tests to ensure everything is optimized before scaling up to large-scale production Training Jobs.
For more details, visit the Interactive Training guide.
-
Infrastructure Metrics Dashboard: Access your dashboard to gain real-time insights into GPU utilization, memory usage, temperature, and other infrastructure metrics. Use this tool to optimize batch sizes, identify potential bottlenecks, and troubleshoot performance issues. Visit dashboards.flex.ai 🔗
-
flexai training runchecks: Before starting a Training Job, FCS performs these sanity checks:- The
--source-revisionin the repository specified through the--source-nameflag exists. - A
requirements.txtfile exists at the root of the source’s revision. - The
requirements.txtfile either:- Doesn’t specify versions for the following packages
torch,torchaudio, andtorchvision. - Doesn’t include the
torch,torchaudio, andtorchvisionpackages.
- Doesn’t specify versions for the following packages
- The Entry point script path exists.
- The
Changed
Section titled “Changed”- Enhanced CLI help text (
--help) forflexai storage create&flexai dataset pushcommands to provide more detailed information on how to use them to push Datasets from a Remote Storage Connection to FCS.