Creating a Fine-tuning Job

With a Dataset available on your FlexAI account, you can now create a Fine-tuning Job that will use it.

The Model's repository

Using the FlexAI Console
Using the FlexAI CLI

For this tutorial we will use the FlexAI fork of the nanoGPT repository 🔗, originally created by Andrej Karpathy 🔗.

A Fine-tuning Job requires at least a Name, a link to a GitHub repository where its code resides, and the path to the entry point script that will initiate the Workload.

In addition, the entry point script can be followed by any arguments required, such as configuration files or Hyperparameters.

Entry Point script arguments

The entry point script path for this quickstart tutorial is ./train.py 🔗, and it expects the following arguments:

config/train_shakespeare_char.py:A configuration file, which contains the default Workload Parameters.
--dataset_dir: The path within the /input directory of the Workload Runtime where the Dataset files are located.
--out_dir: The output directory, which will be mounted into the Workload Runtime as /output-checkpoint.
--max_iters: The maximum number of iterations to run the Workload script for (optional).

Entry Point script arguments details

These include any Environment Settings and Hyperparameters the entry point script may require. For this tutorial:

Parameter	Type	Description
`config/train_shakespeare_char.py`	Environment Setting	A positional argument pointing to a configuration file used by nanoGPT’s `train.py` script to set the default Workload Parameters
`--out_dir=/output-checkpoint`	Environment Setting	The output directory where the Workload script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be `/output-checkpoint`
`--max_iters=1500`	Hyperparameter	The maximum number of iterations to run the Workload script for. This is an optional hyperparameter that can be used to tweak the Workload execution

For this tutorial we will use the FlexAI fork of the nanoGPT repository 🔗, originally created by Andrej Karpathy 🔗.

A Fine-tuning requires at least a Name, a link to a GitHub repository where its code resides, and the path to the entry point script that will initiate the Workload.

In addition, the entry point script can be followed by any arguments required, such as configuration files or Hyperparameters.

Entry Point script arguments

The entry point script path for this quickstart tutorial is ./train.py 🔗, and it expects the following arguments:

config/train_shakespeare_char.py:A configuration file, which contains the default Workload Parameters.
--dataset_dir: The path within the /input directory of the Workload Runtime where the Dataset files are located.
--out_dir: The output directory, which will be mounted into the Workload Runtime as /output-checkpoint.
--max_iters: The maximum number of iterations to run the Workload script for (optional).

Entry Point script arguments details

These include any Environment Settings and Hyperparameters the entry point script may require. For this tutorial:

Parameter	Type	Description
`config/train_shakespeare_char.py`	Environment Setting	A positional argument pointing to a configuration file used by nanoGPT’s `train.py` script to set the default Workload Parameters
`--out_dir=/output-checkpoint`	Environment Setting	The output directory where the Workload script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be `/output-checkpoint`
`--max_iters=1500`	Hyperparameter	The maximum number of iterations to run the Workload script for. This is an optional hyperparameter that can be used to tweak the Workload execution

The Start a new Fine-tuning job form consists of a set of required and optional fields that you can use to customize your deployment.

To open the Start a new Fine-tuning job form

Either:

Follow the direct Fine-tuning page.

Navigate to the Fine-tuning section from either the navigation bar or the card on the home page.
Select the New button to display the creation form.

A drawer menu with the creation form will be displayed.

Required Fields

Name: Your Fine-tuning Job name. Should follow the resource naming conventions.
Repository URL: The URL of the Git repository containing your Fine-tuning code.`.
Entry Point: The path to the entry point script in your repository that will initiate the Fine-tuning Job.
- The entry point script can be followed by any arguments you want to pass to it, such as configurations and Hyperparameters. Value: train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500.

Other fields

Repository Revision: The Git revision (branch, tag, or commit) you want to use for this Fine-tuning Job. The main branch will be used by default.
Node Count: The number of nodes you want to use for this Fine-tuning Job. Defaults to 1.
- This will determine the amount of Accelerators you will have available for your Fine-tuning Job:
  - 1 node will allow you to use up to 8 Accelerators.
  - Using more than 1 node will make all 8 Accelerators per Node available to your Fine-tuning Job.
Accelerator Count: The number of Accelerators you want to use for this Fine-tuning Job. Must follow the logic described above. Defaults to 1.
Datasets: Can be selected from a dropdown list of the datasets you want to use for this Fine-tuning Job. You can add multiple datasets as well as specify the mount path within the Fine-tuning Runtime (they will be mounted under /input). You can read more about this in the Pushing a Dataset guide.

Don’t forget to select the “Add” button after picking a Dataset, otherwise it won’t be added to the Training Job.
Environment Variables & Secrets: Add any environment variables you want to set for this Fine-tuning Job. These will be available to your Fine-tuning code as environment variables within the Training Runtime.
- You can also reference Secrets, which will be securely injected into the Fine-tuning Job’s Runtime.
Cluster: The cluster where the Fine-tuning workload will run on. It can be selected from a dropdown list of available clusters in your FlexAI account. A default cluster will be automatically selected for you if none is specified.

Form Values

Field Name	Value
Name	`nanoGPT-flexai-console`
Repository URL	`https://github.com/flexaihq/nanogpt`
Repository Revision	`main`
Node Count	`1`
Accelerator Count	`1`
Entry Point	`train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500`
Datasets	Dataset: `nanoGPT-dataset` (from the CLI quickstart), Mount Directory: `my_dataset`
Cluster	Your organization’s designated cluster

Entry Point script arguments

The entry point script for this Fine-tuning Job is train.py, and it expects the following arguments:

config/train_shakespeare_char.py:A configuration file, which contains the default Fine-tuning Parameters.
--dataset_dir: The path within the /input directory of the Fine-tuning Runtime where the Dataset files are located.
--out_dir: The output directory, which will be mounted into the Fine-tuning Runtime as /output-checkpoint.
--max_iters: The maximum number of iterations to run the Fine-tuning script for (optional).

Entry Point script arguments details

These include any Environment Settings and Hyperparameters the Fine-tuning script may require. For this tutorial:

Parameter	Type	Description
`config/train_shakespeare_char.py`	Environment Setting	A positional argument pointing to a configuration file used by nanoGPT’s `train.py` script to set the default Fine-tuning Parameters
`--out_dir=/output-checkpoint`	Environment Setting	The output directory where the Fine-tuning script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be `/output-checkpoint`
`--max_iters=1500`	Hyperparameter	The maximum number of iterations to run the Fine-tuning script for. This is an optional hyperparameter that can be used to tweak the Fine-tuning Job execution

After filling out the form, select the Submit button to start the Fine-tuning Job. You should get a confirmation message indicating that the Fine-tuning Job creation process has been initiated successfully.

The Start a new training job form will close and you will be redirected to the Fine-tuning Jobs list page, where you can see your newly created Fine-tuning Job in the list.

Considering the minimum required elements for the creation of a Fine-tuning Job, the following command will initiate its creation and start it running immediately:

flexai training run quickstart-fine-tuning-job
    --dataset nanogpt-dataset=my_dataset
    --repository-url https://github.com/flexaihq/nanogpt
    --checkpoint a1b18a7f-9b85-4c74-91a9-6aca526e8ce4
    --entry-point train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500

Zooming into the flexai training run arguments & flags

Arguments

FlexAI command Argument	Value	Description
Fine-tuning Job Name	`quickstart-Fine-tuning-job`	The name of the Fine-tuning Job

Flags

Flag	Value	Description
Dataset Name	`nanoGPT-dataset=my_dataset`	The name of the Dataset name followed by a custom name representing the mounting path within the Fine-tuning or Fine-tuning runtime: `/input/my_dataset`
Repository URL	`https://github.com/flexaihq/nanogpt`	The URL of the GitHub repository containing the workload’s code
Entry Point Script	`train.py`	The path of entry point Fine-tuning script as defined by the repository

Entry Point script arguments

These include any Environment Settings and Hyperparameters the entry point script may require. Keep in mind that these are specific to the code you’re running:

Entry point script argument	Type	Description
`config/train_shakespeare_char.py`	Environment Setting	A positional argument pointing to a configuration file used by nanoGPT’s `train.py` script to set the default runtime Parameters
`--out_dir=/output-checkpoint`	Environment Setting	The output directory where the script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be `/output-checkpoint`
`--max_iters=1500`	Hyperparameter	The maximum number of iterations to run. This is an optional hyperparameter that can be used to tweak the Workload execution

Up next

Next you'll learn how to get a Fine-tuning Job's details and monitor its progress.