Skip to main content

FAQ

Frequently Asked Questions about FlexAI Cloud Services (FCS).


The stack & requirements

What is the supported stack?

  • Python: 3.11
  • PyTorch 2.4
  • CUDA: 12.4
  • Hardware available: Nvidia

What are the requirements to get onboarded to FlexAI Cloud Services?

  • Model hosted on GitHub (privately or publicly).

Checkpoints

Does FlexAI Cloud Services offer managed Checkpoints?

Yes, FlexAI Cloud Services offers Managed checkpoints out of the box, so you don't need to set anything up.

Every time your code calls torch.save, FlexAI Cloud training Services stores its output so you can resume your Training Job from a previous point in case of a failure or if you need to go back to a specific state.

How do I download a Managed Checkpoint?

Managed checkpoints can be downloaded even if the Training Job is still running. You can a checkpoint by using the flexai checkpoint fetch <checkpoint_name_or_id> command.

What is the maximum size Managed Checkpoints currently support?

Managed Checkpoints will work with checkpoints of up to 5GB in size. If your checkpoints are larger than 5GB, the Managed Checkpoints feature will not be enabled.

However, you can continue to write checkpoints to the /output/ directory without a problem, you'll only have to wait until the Training Job completes before being able to fetch the contents of the /output/ directory.

Do Managed Checkpoints work when running a Training Job across multiple nodes?

Yes, you can use Managed Checkpoints when running a Training Job across multiple nodes. You should make sure only one node is in charge of calling torch.save to avoid conflicts.

How do I resume a Training Job?

You can resume a Training Job from a Checkpoint by creating a new Training Job using flexai training run with the --checkpoint flag and passing the name or ID of the Checkpoint you want to resume from.

Datasets

What is the maximum size of a Dataset that I can use with FlexAI?

We don't set a limit on the size of the Dataset you can use with FlexAI.

Dataset and output Artifact management

  • Dataset will be made available for your training scripts in the /input/ directory.
    • Dataset names are used as the name of the directory in which the Dataset is stored:
      • For example, if you have a Dataset called my_dataset, it will be available in the /input/my_dataset/ directory.
    • The /input/ directory and its contents are read-only.
  • Your training scripts should write any output artifacts or the results of data processing operations to the /output/ directory.
    • Data written to any other directory will be erased after the Training Job is completed (successfully or otherwise).
    • Once a Training Job is completed, the contents of the /output/ directory will be made available for download by using the flexai training fetch <training_job_name>.

What file types are supported for Datasets?

We have no restrictions on the file types that can be uploaded to FlexAI Cloud Services. You can upload any file type that is required for your training job, whether it be images, text, audio, or video files, tokenized data, or any other file type.

Observability & Monitoring

How can I monitor Infrastructure Metrics for my training jobs?

You can use the FlexAI Infrastructure Monitor to monitor the performance of your training jobs. It provides real-time system and GPU metrics that will enable you to make optimizations to your training scripts so you can take full advantage of FlexAI Cloud Services compute resources. The data retention period is 30 days.

The FlexAI platform

How are my code dependencies managed?

FlexAI's training runtime handles the installation of dependencies via a requirements.txt file that must be located at the the root of the code repository.

How do I specify the version of PyTorch, TensorFlow, or other libraries?

You should not include references to torch, torchvision or torchaudio in the requirements.txt file. FlexAI will install the correct version of these libraries.

Is the FlexAI user interface CLI or GUI-based?

Today, our interface is CLI-based. This provides developers faster system interaction, granular control, and scripting for repetitive tasks, easier integration with automated workflows. Soon, we will offer a web-based interface as well.

Where is the CLI executed?

The FlexAI CLI is executed on your host machine (non-graphical environments, such as GitHub Codespaces or a remote shell, require a few additional steps) and communicates with FlexAI Cloud Services via the internet.

Is it possible to access more compute/GPU capacity?

Yes, it is possible to have more capacity available. Please reach out to us to discuss your needs.

Does FlexAI Cloud Services offer shared or dedicated environments?

We have a common pool of compute resources with strict isolation between user environments. If you're interested in getting a dedicated environment, please let us know.

FlexAI and 3rd party Compute Providers

Can I use FlexAI on my current infrastructure?

Yes, we can support this through our Compute Provider partnerships. Please reach out to us for more.

As a Compute Provider, can I offer FlexAI as a solution to my users by leveraging my compute power?

Yes, FlexAI can be offered by Compute Providers to their users. Please reach out to us for a deep dive.