A unique name for the Training Job.
Examples
-
gpt2training-1 -
my-model-training
Starts a new Training Job. This command allows you to specify the dataset, repository, hardware requirements, and other parameters for the job.
flexai training run <training_or_fine_tuning_job_name> [flags] -- <entry_point_script_path> [script_args]A unique name for the Training Job.
gpt2training-1 my-model-training The path to the entry point script for the Training or Fine-tuning Job.
gpt2training-1 my-model-training 1 Number of accelerators/GPUs to use.
Affinity rules for the workload.
FlexAI Secrets to make available during the image build process. Format: <flexai_secret_name>=<environment_variable_name>
--build-secret build_config_secret=SECRET_ENV_VAR_TO_USE A Checkpoint to use as a starting point for a Fine-tuning Job.
The name of a previously pushed Checkpoint. Use flexai checkpoint list to see available Checkpoints.
--checkpoint Mixtral-8x7B-v0_1 --checkpoint gemma-3n-E4B-it The UUID of an Inference Ready Checkpoint generated during the execution of a Training or Fine-tuning job. Use flexai training checkpoints to see available Checkpoints.
--checkpoint 3fa85f64-5717-4562-b3fc-2c963f66afa6 Dataset to mount on the runtime environment
--dataset open_web --dataset fineweb-edu Datasets to mount on the runtime environment using a custom mount path
--dataset open_web=data/train/ow --dataset fineweb-edu=/data/train/fineweb-edu nvidia The architecture of the device to run the Inference Endpoint on.
One of:
nvidiaamdtt--device-arch nvidia Environment variables to set in the interactive environment.
--env WANDB_ENTITY=georgec123 --env WANDB_PROJECT=gppt-j Displays this help page.
Disables queuing for this workload: If no resources are available, the workload will fail immediately instead of waiting for resources to become available.
1 The number of nodes across which to distribute the workload.
Selecting more than 1 node will overwrite the value provided in the —accels flag to 8 accelerator per node.
--nodes 1 --nodes 4 main The branch name of the repository. main by default.
--repository-revision secondary --repository-revision testing A commit SHA hash to use.
--repository-revision 9fceb02 --repository-revision e5bd391 A tag name to use.
--repository-revision v1.0.0 --repository-revision release-2024 Git repository URL containing code to mount on the workload environment.
Will be mounted on the /workspace directory.
--repository-url https://github.com/flexaihq/nanoGPT/ --repository-url https://github.com/flexaihq/nanoGPT.git Path to a pip requirements.txt file in the repository.
--requirements-path code/project/requirements.txt Name of the runtime to use
Environment variables that will be set in the Training Runtime.
Secrets are sensitive values like API keys, tokens, or credentials that need to be accessed by your Training Job but should not be exposed in logs or command history. When using the —secret flag, the actual secret values are retrieved from the Secrets Storage and injected into the environment at runtime.
Syntax:
<env_var_name>=<flexai_secret_name>Where <env_var_name> is the name of the environment variable to set, and <flexai_secret_name> is the name of the Secret containing the sensitive value.
--secret HF_TOKEN=hf-token-dev --secret WANDB_API_KEY=wandb-key Enables verbose logging for detailed output during training job execution.
flexai training run gpt2training-1 --dataset wikitext-2-raw-v1 --repository-url https://github.com/flexaihq/nanoGPT/ --accels 4 --secret HF_TOKEN=hf-token-dev --env BATCH_SIZE=32 -- train.py --batch-size 32 --epochs 10