Key Features
Distributed Training
Scale across multiple GPUs and nodes
Checkpoint Management
Automatic saving and resuming
Interactive Sessions
Debug and experiment interactively
Resource Monitoring
Track GPU usage and performance
Anatomy of a Training Job
A Training Job in FlexAI consists of the following components:- Training Script: The script that defines the training process, including data loading, model training, evaluation, Checkpoint generation, and so on.
- Dataset: The Dataset you will use to train your model.
- It can be an existing FlexAI Dataset, or it can be pulled/streamed from an external source during runtime.
- Secrets: Any sensitive information (e.g., API keys, passwords) required for the training process.
- Hyperparameters: Configuration settings that control the training process (e.g., learning rate, batch size).
Training Script
The training script is a Python script that defines the training process. It handles tasks such as:- Loading the base model Checkpoint from the
/input-checkpoint/path, or downloading it from an external source. - Loading a Dataset mounted through the FlexAI Dataset Manager on the
/input/path or from an external source. - Performing any necessary data processing steps.
- Reading the values of any Secrets passed to the Training Job’s runtime environment.
- Applying any hyperparameters configurations passed to the Training Script.
- Running the training loop and periodically saving Checkpoints to the
/output-checkpoints/path.
Dataset
A Training Job generally leverages a Dataset to adjust a model’s weights using new data. The Dataset can be:- An existing FlexAI Dataset.
- Pulled/streamed from an external source during runtime.
Secrets
Secrets are used to securely pass sensitive information to the Training Job’s runtime environment. This can include API keys, passwords, or any other confidential data required for the training process. Secrets are managed through the FlexAI Secret Manager and are injected into the Training Job’s environment as environment variables.Hyperparameters
Hyperparameters are configuration settings that control the behavior of the Training Script. They can include parameters such as learning rate, batch size, number of epochs, where to load the dataset from, where to save the model checkpoints, how often to save checkpoints, and so on.Key Concepts
Checkpoints
FlexAI automatically manages Checkpoints during training, allowing you to resume, analyze, fetch, export or deploy model Checkpoints as Inference Endpoints. Every time your code calls thetorch.save() function to save a model Checkpoint to the /output-checkpoints/ directory, the FlexAI Checkpoint Manager will be triggered and manage the saved Checkpoint, making it available for later use.
A Training Job’s Lifecycle
Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.Third-Party Integration
Your FlexAI Training Jobs can interact with third-party APIs and services by leveraging the FlexAI Secret Manager to securely store sensitive information such as API keys and tokens. These secrets can then be injected into the Training Job’s runtime environment as environment variables, allowing your Training scripts to access them securely and enabling flexible and powerful AI workflows.CLI Reference
Theflexai training command manages Training Jobs: you can start new jobs, monitor their progress, inspect details, fetch output artifacts, and manage their lifecycle from the command line.
A Training Job’s Lifecycle
Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.Getting Started
FlexAI training jobs can be launched in a few steps. The Quickstart guide will walk you through the process of preparing your data and starting your first training job. Here’s a brief overview of the steps involved:- Preparing your dataset for training.
- Launching a training job using FlexAI.
- Monitoring training progress and managing checkpoints.