2024-12-13
Highlights
- [New feature] Interactive Training: Debug and optimize your Training Jobs faster. You can now SSH into a Training Runtime environment and interact with it in real-time, giving you more control.
- [New feature] Infrastructure Metrics Dashboard: Optimize your batch sizes and performance. Access your dashboard to gain real-time insights into GPU utilization, memory usage, temperature, and other infrastructure metrics.
Added
-
Interactive Training: The
flexai training debug-ssh
command lets you SSH into a Training Runtime environment, providing real-time access to its file system.This feature can help you debug Training Jobs more efficiently, run detailed diagnostics, and perform unlimited tests to ensure everything is optimized before scaling up to large-scale production Training Jobs.
For more details, visit the Interactive Training guide.
-
Infrastructure Metrics Dashboard: Access your dashboard to gain real-time insights into GPU utilization, memory usage, temperature, and other infrastructure metrics. Use this tool to optimize batch sizes, identify potential bottlenecks, and troubleshoot performance issues. Visit dashboards.flex.ai
-
flexai training run
checks: Before starting a Training Job, FCS performs these sanity checks:- The
--source-revision
in the repository specified through the--source-name
flag exists. - A
requirements.txt
file exists at the root of the source's revision. - The
requirements.txt
file either:- Doesn't specify versions for the following packages
torch
,torchaudio
, andtorchvision
. - Doesn't include the
torch
,torchaudio
, andtorchvision
packages.
- Doesn't specify versions for the following packages
- The Entrypoint script path exists.
- The
Changed
- Enhanced CLI help text (
--help
) forflexai storage create
&flexai dataset push
commands to provide more detailed information on how to use them to push Datasets from a Remote Storage Connection to FCS.