Distributed Training
Scale across multiple GPUs and nodes
Train AI models from scratch or continue training existing models using FlexAI’s powerful training infrastructure.
You can train public and private models hosted on the Hugging Face Model Hub or models you’ve previously fine-tuned with FlexAI.
Distributed Training
Scale across multiple GPUs and nodes
Checkpoint Management
Automatic saving and resuming
Interactive Sessions
Debug and experiment interactively
Resource Monitoring
Track GPU usage and performance
A Training Job in FlexAI consists of the following components:
The training script is a Python script that defines the training process. It handles tasks such as:
/input-checkpoint/ path, or downloading it from an external source./input/ path or from an external source./output-checkpoints/ path.A Training Job generally leverages a Dataset to adjust a model’s weights using new data. The Dataset can be:
The Dataset can be in any format, and it can contain any kind of files. The Training Script is responsible for loading and processing the data as needed.
Secrets are used to securely pass sensitive information to the Training Job’s runtime environment. This can include API keys, passwords, or any other confidential data required for the training process.
Secrets are managed through the FlexAI Secret Manager and are injected into the Training Job’s environment as environment variables.
Hyperparameters are configuration settings that control the behavior of the Training Script. They can include parameters such as learning rate, batch size, number of epochs, where to load the dataset from, where to save the model checkpoints, how often to save checkpoints, and so on.
FlexAI automatically manages Checkpoints during training, allowing you to resume, analyze, fetch, export or deploy model Checkpoints as Inference Endpoints.
Every time your code calls the torch.save() function to save a model Checkpoint to the /output-checkpoints/ directory, the FlexAI Checkpoint Manager will be triggered and manage the saved Checkpoint, making it available for later use.
Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.
Your FlexAI Training Jobs can interact with third-party APIs and services by leveraging the FlexAI Secret Manager to securely store sensitive information such as API keys and tokens. These secrets can then be injected into the Training Job’s runtime environment as environment variables, allowing your Training scripts to access them securely and enabling flexible and powerful AI workflows.
The flexai training command manages Training Jobs: you can start new jobs, monitor their progress, inspect details, fetch output artifacts, and manage their lifecycle from the command line.
Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.
FlexAI training jobs can be launched in a few steps. The Quickstart guide will walk you through the process of preparing your data and starting your first training job. Here’s a brief overview of the steps involved:
The button below will lead you to the FlexAI Training Quickstart guide’s overview, where you’ll find more details.