Training

Train AI models from scratch or continue training existing models using FlexAI’s powerful training infrastructure. You can train public and private models hosted on the Hugging Face Model Hub or models you’ve previously fine-tuned with FlexAI.

Key Features

Distributed Training

Scale across multiple GPUs and nodes

Checkpoint Management

Automatic saving and resuming

Interactive Sessions

Debug and experiment interactively

Resource Monitoring

Track GPU usage and performance

Anatomy of a Training Job

A Training Job in FlexAI consists of the following components:

Training Script: The script that defines the training process, including data loading, model training, evaluation, Checkpoint generation, and so on.
Dataset: The Dataset you will use to train your model.
- It can be an existing FlexAI Dataset, or it can be pulled/streamed from an external source during runtime.
Secrets: Any sensitive information (e.g., API keys, passwords) required for the training process.
Hyperparameters: Configuration settings that control the training process (e.g., learning rate, batch size).

Training Script

The training script is a Python script that defines the training process. It handles tasks such as:

Loading the base model Checkpoint from the /input-checkpoint/ path, or downloading it from an external source.
Loading a Dataset mounted through the FlexAI Dataset Manager on the /input/ path or from an external source.
Performing any necessary data processing steps.
Reading the values of any Secrets passed to the Training Job’s runtime environment.
Applying any hyperparameters configurations passed to the Training Script.
Running the training loop and periodically saving Checkpoints to the /output-checkpoints/ path.

Dataset

A Training Job generally leverages a Dataset to adjust a model’s weights using new data. The Dataset can be:

An existing FlexAI Dataset.
Pulled/streamed from an external source during runtime.

The Dataset can be in any format, and it can contain any kind of files. The Training Script is responsible for loading and processing the data as needed.

Secrets

Secrets are used to securely pass sensitive information to the Training Job’s runtime environment. This can include API keys, passwords, or any other confidential data required for the training process. Secrets are managed through the FlexAI Secret Manager and are injected into the Training Job’s environment as environment variables.

Hyperparameters

Hyperparameters are configuration settings that control the behavior of the Training Script. They can include parameters such as learning rate, batch size, number of epochs, where to load the dataset from, where to save the model checkpoints, how often to save checkpoints, and so on.

Key Concepts

Checkpoints

FlexAI automatically manages Checkpoints during training, allowing you to resume, analyze, fetch, export or deploy model Checkpoints as Inference Endpoints. Every time your code calls the torch.save() function to save a model Checkpoint to the /output-checkpoints/ directory, the FlexAI Checkpoint Manager will be triggered and manage the saved Checkpoint, making it available for later use.

A Training Job’s Lifecycle

Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.

Third-Party Integration

Your FlexAI Training Jobs can interact with third-party APIs and services by leveraging the FlexAI Secret Manager to securely store sensitive information such as API keys and tokens. These secrets can then be injected into the Training Job’s runtime environment as environment variables, allowing your Training scripts to access them securely and enabling flexible and powerful AI workflows.

CLI Reference

The flexai training command manages Training Jobs: you can start new jobs, monitor their progress, inspect details, fetch output artifacts, and manage their lifecycle from the command line.

A Training Job’s Lifecycle

Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.

Getting Started

FlexAI training jobs can be launched in a few steps. The Quickstart guide will walk you through the process of preparing your data and starting your first training job. Here’s a brief overview of the steps involved:

Preparing your dataset for training.
Launching a training job using FlexAI.
Monitoring training progress and managing checkpoints.

The button below will lead you to the FlexAI Training Quickstart guide’s overview, where you’ll find more details.

Getting Started

Inference

Fine-tuning

Platform Services

Interactive Development

CLI

Console

Best Practices

FAQ

Training

Key Features

Distributed Training

Checkpoint Management

Interactive Sessions

Resource Monitoring

Anatomy of a Training Job

Training Script

Dataset

Secrets

Hyperparameters

Key Concepts

Checkpoints

A Training Job’s Lifecycle

Third-Party Integration

CLI Reference

A Training Job’s Lifecycle

Getting Started

Getting Started

Inference

Fine-tuning

Training

Platform Services

Interactive Development

CLI

Console

Best Practices

FAQ

​Key Features

Distributed Training

Checkpoint Management

Interactive Sessions

Resource Monitoring

​Anatomy of a Training Job

​Training Script

​Dataset

​Secrets

​Hyperparameters

​Key Concepts

​Checkpoints

​A Training Job’s Lifecycle

​Third-Party Integration

​CLI Reference

​A Training Job’s Lifecycle

​Getting Started

Key Features

Anatomy of a Training Job

Training Script

Dataset

Secrets

Hyperparameters

Key Concepts

Checkpoints

A Training Job’s Lifecycle

Third-Party Integration

CLI Reference

A Training Job’s Lifecycle

Getting Started