Skip to content

Training

Train AI models from scratch or continue training existing models using FlexAI’s powerful training infrastructure.

You can train public and private models hosted on the Hugging Face Model Hub or models you’ve previously fine-tuned with FlexAI.

Distributed Training

Scale across multiple GPUs and nodes

Checkpoint Management

Automatic saving and resuming

Interactive Sessions

Debug and experiment interactively

Resource Monitoring

Track GPU usage and performance

A Training Job in FlexAI consists of the following components:

  1. Training Script: The script that defines the training process, including data loading, model training, evaluation, Checkpoint generation, and so on.
  2. Dataset: The Dataset you will use to train your model.
    • It can be an existing FlexAI Dataset, or it can be pulled/streamed from an external source during runtime.
  3. Secrets: Any sensitive information (e.g., API keys, passwords) required for the training process.
  4. Hyperparameters: Configuration settings that control the training process (e.g., learning rate, batch size).

The training script is a Python script that defines the training process. It handles tasks such as:

  • Loading the base model Checkpoint from the /input-checkpoint/ path, or downloading it from an external source.
  • Loading a Dataset mounted through the FlexAI Dataset Manager on the /input/ path or from an external source.
  • Performing any necessary data processing steps.
  • Reading the values of any Secrets passed to the Training Job’s runtime environment.
  • Applying any hyperparameters configurations passed to the Training Script.
  • Running the training loop and periodically saving Checkpoints to the /output-checkpoints/ path.

A Training Job generally leverages a Dataset to adjust a model’s weights using new data. The Dataset can be:

  • An existing FlexAI Dataset.
  • Pulled/streamed from an external source during runtime.

The Dataset can be in any format, and it can contain any kind of files. The Training Script is responsible for loading and processing the data as needed.

Secrets are used to securely pass sensitive information to the Training Job’s runtime environment. This can include API keys, passwords, or any other confidential data required for the training process.

Secrets are managed through the FlexAI Secret Manager and are injected into the Training Job’s environment as environment variables.

Hyperparameters are configuration settings that control the behavior of the Training Script. They can include parameters such as learning rate, batch size, number of epochs, where to load the dataset from, where to save the model checkpoints, how often to save checkpoints, and so on.

FlexAI automatically manages Checkpoints during training, allowing you to resume, analyze, fetch, export or deploy model Checkpoints as Inference Endpoints.

Every time your code calls the torch.save() function to save a model Checkpoint to the /output-checkpoints/ directory, the FlexAI Checkpoint Manager will be triggered and manage the saved Checkpoint, making it available for later use.

Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.

Your FlexAI Training Jobs can interact with third-party APIs and services by leveraging the FlexAI Secret Manager to securely store sensitive information such as API keys and tokens. These secrets can then be injected into the Training Job’s runtime environment as environment variables, allowing your Training scripts to access them securely and enabling flexible and powerful AI workflows.

The flexai training command manages Training Jobs: you can start new jobs, monitor their progress, inspect details, fetch output artifacts, and manage their lifecycle from the command line.

Learn more about the different statuses a Training Job can have in the Training Job Lifecycle page.

FlexAI training jobs can be launched in a few steps. The Quickstart guide will walk you through the process of preparing your data and starting your first training job. Here’s a brief overview of the steps involved:

  1. Preparing your dataset for training.
  2. Launching a training job using FlexAI.
  3. Monitoring training progress and managing checkpoints.

The button below will lead you to the FlexAI Training Quickstart guide’s overview, where you’ll find more details.