The name that will be used to identify the Training Job while performing operations on it.
Examples
-
gpt2training-1
Starts a new Training Job. This command allows you to specify the dataset, repository, hardware requirements, and other parameters for the job.
flexai training run <training_job_name> \ --dataset <dataset_name>... \ --repository-url <repository_url> \ [--repository-revision <branch_name>] \ [--checkpoint <checkpoint_id>] \ [--env <env_var>=<value>...] \ [--secret <env_var>=<secret_value>...] \ [--nodes <node_amount>] \ [--accels <accelerator_amount>] \ -- <entry_point_script_path> [<entry_point_script_args>...]
The name that will be used to identify the Training Job while performing operations on it.
gpt2training-1
End-of-options marker. The first positional argument will be the entry point script path, which is generally the training script.
flexai training run ... -- train.py
The path to the script to execute as the Training Runtime is started
train.py
src/fine-tune.py
main.py
Arguments to pass to the entrypoint script
config/train_shakespeare_char.py
data/dataset.json
10000
resnet50
configs/hyperparams.yaml
Flags to pass to the entrypoint script
--batch_size=32 --epochs=10
--test-run --learning-rate=0.001
USE_WANDB=true --wandb-project=my-project
1
Number of accelerators to use for the workload.
--accels 4
main
The branch name of the repository.
--repository-revision main
The commit hash of the repository.
--repository-revision 53f6b645fc5d039152aef884def64288e3eeb56b
The tag name of the repository.
--repository-revision v1.0.0
The ID of the Checkpoint generated during Training Job’s execution (see flexai training checkpoints
).
--checkpoint mistral-500-checkpoint
The name of a user-provided Checkpoint (see flexai checkpoint
).
--checkpoint a1b18a7f-9b85-4c74-91a9-6aca526e8ce4
A Dataset name or a key=value
pair representing a Dataset and a custom mount point on the Training Runtime.
Multiple Datasets can be used within a single Training Job. Depending on which value format is passed (Resource Name or Key Value Path Mapping), they will be mounted to either of:
/input/<dataset_name>
/input/<dataset_mount_path>
See the available value format options below.
The ID of a Dataset (see flexai dataset list
).
--dataset wikitext-2-raw-v1
A key=value
pair representing a Dataset to use and its destination mount path on the Training Runtime.
Syntax:
<dataset_name>=<dataset_mount_path>
--dataset wikitext-2-raw-v1=/wikitext2/v1
nvidia
nvidia
--device-arch nvidia
Environment variables that will be set in the Training Runtime.
--env BATCH_SIZE=32
--env WANDB_PROJECT=my-project-123
1
Number of nodes to use for the workload.
--nodes 4
./
Path to the requirements.txt
file that will be used to install the dependencies in the Training Runtime.
This path is relative to the root of the repository (specified by the --repository-url
flag).
--requirements-path path/to/requirements.txt
Environment variables that will be set in the Training Runtime. The values of these variables are the names of Secrets (see flexai secret list
).
Secrets are sensitive values like API keys, tokens, or credentials that need to be accessed by your Training Job but should not be exposed in logs or command history. When using the --secret
flag, the actual secret values are retrieved from the Secrets Storage and injected into the environment at runtime.
Syntax:
<env_var_name>=<secret_name>
Where <env_var_name>
is the name of the environment variable to set, and <secret_name>
is the name of the Secret to use as the value.
--secret HF_TOKEN=hf-token-dev
--secret WANDB_API_KEY=wandb-key
The URL of the Git repository containing the training code.
--repository-url https://github.com/flexaihq/nanoGPT/
--repository-url https://github.com/flexaihq/nanoGPT.git
flexai training run gpt2training-1 \ --dataset wikitext-2-raw-v1 \ --repository-url https://github.com/flexaihq/nanoGPT/ \ --accels 4 \ --secret HF_TOKEN=hf-token-dev \ --env BATCH_SIZE=32 \ -- train.py --batch-size 32 --epochs 10