Skip to content

training run

Starts a new Training Job. This command allows you to specify the dataset, repository, hardware requirements, and other parameters for the job.

flexai training run <training_or_fine_tuning_job_name> [flags] -- <entry_point_script_path> [script_args]

A unique name for the Training Job.

Examples
  • gpt2training-1
  • my-model-training
Required

The path to the entry point script for the Training or Fine-tuning Job.

Examples
  • gpt2training-1
  • my-model-training
-a , --accels
<integer>
Optional
Default Value: 1
Integer

Number of accelerators/GPUs to use.

--affinity
<key=value>
Optional
--build-secret
<key=value>
Optional

FlexAI Secrets to make available during the image build process. Format: <flexai_secret_name>=<environment_variable_name>

Examples
  • --build-secret build_config_secret=SECRET_ENV_VAR_TO_USE
Optional

A Checkpoint to use as a starting point for a Fine-tuning Job.

The name of a previously pushed Checkpoint. Use flexai checkpoint list to see available Checkpoints.

Examples
  • --checkpoint Mixtral-8x7B-v0_1
  • --checkpoint gemma-3n-E4B-it
Optional

Dataset to mount on the runtime environment

Examples
  • --dataset open_web
  • --dataset fineweb-edu

Datasets to mount on the runtime environment using a custom mount path

Examples
  • --dataset open_web=data/train/ow --dataset fineweb-edu=/data/train/fineweb-edu
Optional
Default Value: nvidia
Option list

The architecture of the device to run the Inference Endpoint on.

One of:

  • nvidia
  • amd
  • tt
Examples
  • --device-arch nvidia
-E , --env
<key=value>
Optional

Environment variables to set in the interactive environment.

Examples
  • --env WANDB_ENTITY=georgec123 --env WANDB_PROJECT=gppt-j
-h , --help
<boolean>
Optional
Flag

Displays this help page.

--no-queuing
<boolean>
Optional
Flag

Disables queuing for this workload: If no resources are available, the workload will fail immediately instead of waiting for resources to become available.

-n , --nodes
<integer>
Optional
Default Value: 1
Integer

The number of nodes across which to distribute the workload.

Selecting more than 1 node will overwrite the value provided in the —accels flag to 8 accelerator per node.

Examples
  • --nodes 1
  • --nodes 4
Default Value: main
String

The branch name of the repository. main by default.

Examples
  • --repository-revision secondary
  • --repository-revision testing
UUID

A commit SHA hash to use.

Examples
  • --repository-revision 9fceb02
  • --repository-revision e5bd391
String

A tag name to use.

Examples
  • --repository-revision v1.0.0
  • --repository-revision release-2024
Optional
URL

Git repository URL containing code to mount on the workload environment.

Will be mounted on the /workspace directory.

Examples
  • --repository-url https://github.com/flexaihq/nanoGPT/
  • --repository-url https://github.com/flexaihq/nanoGPT.git
Optional
Path

Path to a pip requirements.txt file in the repository.

Examples
  • --requirements-path code/project/requirements.txt
Optional
String

Name of the runtime to use

-S , --secret
<key=value>
Optional

Environment variables that will be set in the Training Runtime.

Secrets are sensitive values like API keys, tokens, or credentials that need to be accessed by your Training Job but should not be exposed in logs or command history. When using the —secret flag, the actual secret values are retrieved from the Secrets Storage and injected into the environment at runtime.

Syntax:

  • <env_var_name>=<flexai_secret_name>

Where <env_var_name> is the name of the environment variable to set, and <flexai_secret_name> is the name of the Secret containing the sensitive value.

Examples
  • --secret HF_TOKEN=hf-token-dev
  • --secret WANDB_API_KEY=wandb-key
Optional
Flag

Enables verbose logging for detailed output during training job execution.

Start a training job with dataset and repository

Section titled “Start a training job with dataset and repository”
flexai training run gpt2training-1 --dataset wikitext-2-raw-v1 --repository-url https://github.com/flexaihq/nanoGPT/ --accels 4 --secret HF_TOKEN=hf-token-dev --env BATCH_SIZE=32 -- train.py --batch-size 32 --epochs 10