Skip to content

training run

Starts a new Training Job. This command allows you to specify the dataset, repository, hardware requirements, and other parameters for the job.

Terminal window
flexai training run <training_job_name> \
--dataset <dataset_name>... \
--repository-url <repository_url> \
[--repository-revision <branch_name>] \
[--checkpoint <checkpoint_id>] \
[--env <env_var>=<value>...] \
[--secret <env_var>=<secret_value>...] \
[--nodes <node_amount>] \
[--accels <accelerator_amount>] \
-- <entry_point_script_path> [<entry_point_script_args>...]
Required

The name that will be used to identify the Training Job while performing operations on it.

Examples
  • gpt2training-1
--
<string>
Required
End of Options Marker

End-of-options marker. The first positional argument will be the entry point script path, which is generally the training script.

Examples
  • flexai training run ... -- train.py
Required
Path

The path to the script to execute as the Training Runtime is started

Examples
  • train.py
  • src/fine-tune.py
  • main.py
script_args
<string>
Optional
Argument

Arguments to pass to the entrypoint script

Examples
  • config/train_shakespeare_char.py
  • data/dataset.json
  • 10000
  • resnet50
  • configs/hyperparams.yaml
Flag

Flags to pass to the entrypoint script

Examples
  • --batch_size=32 --epochs=10
  • --test-run --learning-rate=0.001
  • USE_WANDB=true --wandb-project=my-project
-a , --accels
<integer>
Optional
Default Value: 1
Integer

Number of accelerators to use for the workload.

Examples
  • --accels 4
Default Value: main
String

The branch name of the repository.

Examples
  • --repository-revision main
UUID

The commit hash of the repository.

Examples
  • --repository-revision 53f6b645fc5d039152aef884def64288e3eeb56b
String

The tag name of the repository.

Examples
  • --repository-revision v1.0.0
Optional
The identifier of a Checkpoint to use as the starting point for the Training Job.
UUID

The name of a user-provided Checkpoint (see flexai checkpoint).

Examples
  • --checkpoint a1b18a7f-9b85-4c74-91a9-6aca526e8ce4
-D , --dataset
<string><key=value>
Required

A Dataset name or a key=value pair representing a Dataset and a custom mount point on the Training Runtime.

Multiple Datasets can be used within a single Training Job. Depending on which value format is passed (Resource Name or Key Value Path Mapping), they will be mounted to either of:

  • /input/<dataset_name>
  • /input/<dataset_mount_path>

See the available value format options below.

Key Value Path Mapping

A key=value pair representing a Dataset to use and its destination mount path on the Training Runtime.

Syntax:

  • <dataset_name>=<dataset_mount_path>
Examples
  • --dataset wikitext-2-raw-v1=/wikitext2/v1
-d , --device-arch
<option_list>
Optional
Default Value: nvidia
Option list
  • nvidia
Examples
  • --device-arch nvidia
-E , --env
<key=value>
Optional

Environment variables that will be set in the Training Runtime.

Examples
  • --env BATCH_SIZE=32
  • --env WANDB_PROJECT=my-project-123
-n , --nodes
<integer>
Optional
Default Value: 1
Integer

Number of nodes to use for the workload.

Examples
  • --nodes 4
Optional
Default Value: ./
String

Path to the requirements.txt file that will be used to install the dependencies in the Training Runtime.

This path is relative to the root of the repository (specified by the --repository-url flag).

Examples
  • --requirements-path path/to/requirements.txt
-S , --secret
<key=value>
Optional

Environment variables that will be set in the Training Runtime. The values of these variables are the names of Secrets (see flexai secret list).

Secrets are sensitive values like API keys, tokens, or credentials that need to be accessed by your Training Job but should not be exposed in logs or command history. When using the --secret flag, the actual secret values are retrieved from the Secrets Storage and injected into the environment at runtime.

Syntax:

  • <env_var_name>=<secret_name>

Where <env_var_name> is the name of the environment variable to set, and <secret_name> is the name of the Secret to use as the value.

Examples
  • --secret HF_TOKEN=hf-token-dev
  • --secret WANDB_API_KEY=wandb-key
Required
Git Repository

The URL of the Git repository containing the training code.

Examples
  • --repository-url https://github.com/flexaihq/nanoGPT/
  • --repository-url https://github.com/flexaihq/nanoGPT.git
Terminal window
flexai training run gpt2training-1 \
--dataset wikitext-2-raw-v1 \
--repository-url https://github.com/flexaihq/nanoGPT/ \
--accels 4 \
--secret HF_TOKEN=hf-token-dev \
--env BATCH_SIZE=32 \
-- train.py --batch-size 32 --epochs 10