Frequently Asked Questions: FlexAI

The stack & requirements

What is the supported stack?

Python: 3.11
PyTorch 2.4
CUDA: 12.4
Hardware available: Nvidia

Which Git providers are supported?

FlexAI supports both private and public GitHub repositories.

Checkpoints

Does FlexAI offer managed Checkpoints?

Yes, FlexAI offers Managed checkpoints out of the box, so you don’t need to set anything up.

Every time your code calls torch.save, FlexAI Cloud training Services stores its output so you can resume your Training Job from a previous point in case of a failure or if you need to go back to a specific state.

Checkpoints should be saved to the /output-checkpoint directory, which is automatically mounted to your Training Job’s container. You can refer to it by using the FLEXAI_OUTPUT_CHECKPOINT_DIR environment variable.

How do I download a Managed Checkpoint?

Managed checkpoints can be downloaded even if the Training Job is still running. You can a checkpoint by using the flexai checkpoint fetch <checkpoint_name_or_id> command.

What is the maximum size Managed Checkpoints currently support?

Managed Checkpoints will work with checkpoints of up to 5GB in size. If your checkpoints are larger than 5GB, the Managed Checkpoints feature will not be enabled.

However, you can continue to write checkpoints to the /output/ (also available through the FLEXAI_TRAINING_OUTPUT_PATH environment variable) directory without a problem, you’ll only have to wait until the Training Job completes before being able to fetch the contents of the /output/ directory using the flexai training fetch <training_job_name> command.

Do Managed Checkpoints work when running a Training Job across multiple nodes?

Yes, you can use Managed Checkpoints when running a Training Job across multiple nodes. You should make sure only one node is in charge of calling torch.save to avoid conflicts.

How do I resume a Training Job?

You can resume a Training Job from a Checkpoint by creating a new Training Job using flexai training run with the --checkpoint flag and passing the name or ID of the Checkpoint you want to resume from. Your code should look for the checkpoint file in the /input-checkpoint/ directory, which is also available through the FLEXAI_INPUT_CHECKPOINT_DIR environment variable.

Datasets

What is the maximum size of a Dataset that I can use with FlexAI?

We don’t set a limit on the size of the Dataset you can use with FlexAI.

Dataset and output Artifact management

Dataset will be made available for your training scripts in the /input/ directory, which can also be referenced using the FLEXAI_TRAINING_INPUT_PATH environment variable.
- Dataset names are used as the name of the directory in which the Dataset is stored:
  - For example, if you have a Dataset called my_dataset, it will be available in the /input/my_dataset/ directory.
- The /input/ directory and its contents are read-only.
Your training scripts should write any output artifacts or the results of data processing operations to the /output/ directory.
- Data written to any other directory will be erased after the Training Job is completed (successfully or otherwise).
- Once a Training Job is completed, the contents of the /output/ directory will be made available for download by using the flexai training fetch <training_job_name>.

What file types are supported for Datasets?

We have no restrictions on the file types that can be uploaded to FlexAI. You can upload any file type that is required for your Training Job, whether it be images, text, audio, or video files, tokenized data, or any other file type.

Observability & Monitoring

How can I monitor Infrastructure Metrics for my Training and Fine-tuning Jobs?

You can use the FlexAI Infrastructure Monitor 🔗 to monitor the performance of your Training and Fine-tuning Jobs. It provides real-time system and GPU metrics that will enable you to make optimizations to your training scripts so you can take full advantage of FlexAI compute resources. The data retention period is 30 days.

How can I monitor my Training and Fine-tuning Jobs?

You can use the FlexAI TensorBoard 🔗 to monitor your Training and Fine-tuning Jobs. It provides real-time visualizations of your training metrics, such as loss and accuracy, and allows you to compare different runs. The data retention period is 30 days.

The FlexAI platform

How are my code dependencies managed?

FlexAI’s training runtime handles the installation of dependencies via a requirements.txt file. The FlexAI Runtime environment will look for it in the root of your code repository and install the dependencies listed there. You can also specify a custom location for the requirements.txt file using the -q/--requirements-path flag when running a Training or Fine-tuning Job.

How do I specify the version of my PyTorch-associated dependencies?

You should not include references to torch, torchvision or torchaudio in the requirements.txt file. FlexAI will install the correct version of these libraries based on the Runtime environment.

Which types of clients can I use to interact with FlexAI?

FlexAI Console 🔗: a web-based interface.
FlexAI CLI: which is available for Linux, macOS, and Windows (through WSL2).

Where is the CLI executed?

The FlexAI CLI is executed on your host machine (non-graphical environments, such as GitHub Codespaces or a remote shell, require a few additional steps) and communicates with FlexAI via the internet.

Is it possible to access more compute/GPU capacity?

Yes, it is possible to have more capacity available. Please reach out to us to discuss your needs.

Does FlexAI offer shared or dedicated environments?

We have a common pool of compute resources with strict isolation between user environments. If you’re interested in getting a dedicated environment, please let us know.

FlexAI and 3rd party Compute Providers

Can I use FlexAI on my current infrastructure?

Yes, we can support this through our Compute Provider partnerships. Please reach out to us for more.

As a Compute Provider, can I offer FlexAI as a solution to my users by leveraging my compute power?

Yes, FlexAI can be offered by Compute Providers to their users. Please reach out to us for a deep dive.