Frequently Asked Questions: FlexAI
The stack & requirements
Section titled “The stack & requirements”What is the supported stack?
Section titled “What is the supported stack?”- Python: 3.11
- PyTorch 2.4
- CUDA: 12.4
- Hardware available: Nvidia
Which Git providers are supported?
Section titled “Which Git providers are supported?”- FlexAI supports both private and public GitHub repositories.
Checkpoints
Section titled “Checkpoints”Does FlexAI offer managed Checkpoints?
Section titled “Does FlexAI offer managed Checkpoints?”Yes, FlexAI offers Managed checkpoints out of the box, so you don’t need to set anything up.
Every time your code calls torch.save
, FlexAI Cloud training Services stores its output so you can resume your Training Job from a previous point in case of a failure or if you need to go back to a specific state.
Checkpoints should be saved to the /output-checkpoint
directory, which is automatically mounted to your Training Job’s container. You can refer to it by using the FLEXAI_OUTPUT_CHECKPOINT_DIR
environment variable.
How do I download a Managed Checkpoint?
Section titled “How do I download a Managed Checkpoint?”Managed checkpoints can be downloaded even if the Training Job is still running. You can a checkpoint by using the flexai checkpoint fetch <checkpoint_name_or_id>
command.
What is the maximum size Managed Checkpoints currently support?
Section titled “What is the maximum size Managed Checkpoints currently support?”Managed Checkpoints will work with checkpoints of up to 5GB in size. If your checkpoints are larger than 5GB, the Managed Checkpoints feature will not be enabled.
However, you can continue to write checkpoints to the /output/
(also available through the FLEXAI_TRAINING_OUTPUT_PATH
environment variable) directory without a problem, you’ll only have to wait until the Training Job completes before being able to fetch the contents of the /output/
directory using the flexai training fetch <training_job_name>
command.
Do Managed Checkpoints work when running a Training Job across multiple nodes?
Section titled “Do Managed Checkpoints work when running a Training Job across multiple nodes?”Yes, you can use Managed Checkpoints when running a Training Job across multiple nodes. You should make sure only one node is in charge of calling torch.save
to avoid conflicts.
How do I resume a Training Job?
Section titled “How do I resume a Training Job?”You can resume a Training Job from a Checkpoint by creating a new Training Job using flexai training run
with the --checkpoint
flag and passing the name or ID of the Checkpoint you want to resume from. Your code should look for the checkpoint file in the /input-checkpoint/
directory, which is also available through the FLEXAI_INPUT_CHECKPOINT_DIR
environment variable.
Datasets
Section titled “Datasets”What is the maximum size of a Dataset that I can use with FlexAI?
Section titled “What is the maximum size of a Dataset that I can use with FlexAI?”We don’t set a limit on the size of the Dataset you can use with FlexAI.
Dataset and output Artifact management
Section titled “Dataset and output Artifact management”- Dataset will be made available for your training scripts in the
/input/
directory, which can also be referenced using theFLEXAI_TRAINING_INPUT_PATH
environment variable.- Dataset names are used as the name of the directory in which the Dataset is stored:
- For example, if you have a Dataset called
my_dataset
, it will be available in the/input/my_dataset/
directory.
- For example, if you have a Dataset called
- The
/input/
directory and its contents are read-only.
- Dataset names are used as the name of the directory in which the Dataset is stored:
- Your training scripts should write any output artifacts or the results of data processing operations to the
/output/
directory.- Data written to any other directory will be erased after the Training Job is completed (successfully or otherwise).
- Once a Training Job is completed, the contents of the
/output/
directory will be made available for download by using theflexai training fetch <training_job_name>
.
What file types are supported for Datasets?
Section titled “What file types are supported for Datasets?”We have no restrictions on the file types that can be uploaded to FlexAI. You can upload any file type that is required for your Training Job, whether it be images, text, audio, or video files, tokenized data, or any other file type.
Observability & Monitoring
Section titled “Observability & Monitoring”How can I monitor Infrastructure Metrics for my Training and Fine-tuning Jobs?
Section titled “How can I monitor Infrastructure Metrics for my Training and Fine-tuning Jobs?”You can use the FlexAI Infrastructure Monitor 🔗 to monitor the performance of your Training and Fine-tuning Jobs. It provides real-time system and GPU metrics that will enable you to make optimizations to your training scripts so you can take full advantage of FlexAI compute resources. The data retention period is 30 days.
How can I monitor my Training and Fine-tuning Jobs?
Section titled “How can I monitor my Training and Fine-tuning Jobs?”You can use the FlexAI TensorBoard 🔗 to monitor your Training and Fine-tuning Jobs. It provides real-time visualizations of your training metrics, such as loss and accuracy, and allows you to compare different runs. The data retention period is 30 days.
The FlexAI platform
Section titled “The FlexAI platform”How are my code dependencies managed?
Section titled “How are my code dependencies managed?”FlexAI’s training runtime handles the installation of dependencies via a requirements.txt
file. The FlexAI Runtime environment will look for it in the root of your code repository and install the dependencies listed there. You can also specify a custom location for the requirements.txt
file using the -q
/--requirements-path
flag when running a Training or Fine-tuning Job.
How do I specify the version of my PyTorch-associated dependencies?
Section titled “How do I specify the version of my PyTorch-associated dependencies?”You should not include references to torch
, torchvision
or torchaudio
in the requirements.txt
file. FlexAI will install the correct version of these libraries based on the Runtime environment.
Which types of clients can I use to interact with FlexAI?
Section titled “Which types of clients can I use to interact with FlexAI?”- FlexAI Console 🔗: a web-based interface.
- FlexAI CLI: which is available for Linux, macOS, and Windows (through WSL2).
Where is the CLI executed?
Section titled “Where is the CLI executed?”The FlexAI CLI is executed on your host machine (non-graphical environments, such as GitHub Codespaces or a remote shell, require a few additional steps) and communicates with FlexAI via the internet.
Is it possible to access more compute/GPU capacity?
Section titled “Is it possible to access more compute/GPU capacity?”Yes, it is possible to have more capacity available. Please reach out to us to discuss your needs.
Does FlexAI offer shared or dedicated environments?
Section titled “Does FlexAI offer shared or dedicated environments?”We have a common pool of compute resources with strict isolation between user environments. If you’re interested in getting a dedicated environment, please let us know.
FlexAI and 3rd party Compute Providers
Section titled “FlexAI and 3rd party Compute Providers”Can I use FlexAI on my current infrastructure?
Section titled “Can I use FlexAI on my current infrastructure?”Yes, we can support this through our Compute Provider partnerships. Please reach out to us for more.
As a Compute Provider, can I offer FlexAI as a solution to my users by leveraging my compute power?
Section titled “As a Compute Provider, can I offer FlexAI as a solution to my users by leveraging my compute power?”Yes, FlexAI can be offered by Compute Providers to their users. Please reach out to us for a deep dive.