Skip to main content
Starts a new Training Job. This command allows you to specify the dataset, repository, hardware requirements, and other parameters for the job.

Usage

flexai training run <training_or_fine_tuning_job_name> [flags] -- <entry_point_script_path> [script_args]

Arguments

ArgumentTypeRequiredDescription
training_or_fine_tuning_job_namestringYesA unique name for the Training Job.
entry_point_script_pathstringYesThe path to the entry point script for the Training or Fine-tuning Job.

Flags

FlagShortTypeDescription
--accels-aintegerNumber of accelerators/GPUs to use. Default: 1
--affinitykey-valueAffinity rules for the workload.
--build-secretkey-valueFlexAI Secrets to make available during the image build process. Format: <flexai_secret_name>=<environment_variable_name>
--checkpoint-CstringA Checkpoint to use as a starting point for a Fine-tuning Job. The name of a previously pushed Checkpoint (use flexai checkpoint list to see available Checkpoints) or the UUID of an Inference Ready Checkpoint generated during the execution of a Training or Fine-tuning job (use flexai training checkpoints to see available Checkpoints).
--dataset-DstringDataset to mount on the runtime environment. Can be specified as a simple name or as a key-value mapping to use a custom mount path.
--device-arch-dstringThe architecture of the device to run the Inference Endpoint on. Default: nvidia
--env-Ekey-valueEnvironment variables to set in the interactive environment.
--help-hbooleanDisplays this help page.
--no-queuingbooleanDisables queuing for this workload: If no resources are available, the workload will fail immediately instead of waiting for resources to become available.
--nodes-nintegerThe number of nodes across which to distribute the workload. Selecting more than 1 node will overwrite the value provided in the --accels flag to 8 accelerator per node. Default: 1
--repository-revision-bstringThe branch name of the repository (default: main), a commit SHA hash, or a tag name.
--repository-url-ustringGit repository URL containing code to mount on the workload environment. Will be mounted on the /workspace directory.
--requirements-path-qstringPath to a pip requirements.txt file in the repository.
--runtime-rstringName of the runtime to use
--secret-Skey-valueEnvironment variables that will be set in the Training Runtime. Secrets are sensitive values like API keys, tokens, or credentials that need to be accessed by your Training Job but should not be exposed in logs or command history. When using the —secret flag, the actual secret values are retrieved from the Secrets Storage and injected into the environment at runtime. Syntax: <env_var_name>=<flexai_secret_name> where <env_var_name> is the name of the environment variable to set, and <flexai_secret_name> is the name of the Secret containing the sensitive value.
--verbose-vbooleanEnables verbose logging for detailed output during training job execution.

Examples

Start a training job with dataset and repository

flexai training run gpt2training-1 --dataset wikitext-2-raw-v1 --repository-url https://github.com/flexaihq/nanoGPT/ --accels 4 --secret HF_TOKEN=hf-token-dev --env BATCH_SIZE=32 -- train.py --batch-size 32 --epochs 10