Skip to content

Creating a Fine-Tuning Job

This tutorial builds on top of the quickstart tutorials that cover the “Model Training” topic, so we will continue to use the FlexAI fork of the nanoGPT repository originally created by Andrej Karpathy: https://github.com/flexaihq/nanogpt 🔗.

nanoGPT’s entry point script path is train.py, and it expects a path to a configuration file as well as a few Hyperparameters to run the Training Job.

  1. Log into https://console.flex.ai 🔗 using your FlexAI account credentials.

  2. Navigate to the Fine-tuning section from either the navigation bar or the card on the home page.

  3. A drawer menu with the creation form will appear automatically. You can also select the New button to display the creation form.

The Start a new Fine-tuning job form consists of a set of required and optional fields that you can use to customize your deployment.

  • Name: Your Fine-tuning Job name. Should follow the resource naming conventions.
  • Repository URL: The URL of the Git repository containing your fine-tuning code.`.
  • Checkpoint: The name or ID of the Checkpoint you want this Fine-tuning Job to start from.
  • Entry Point: The path to the entry point script in your repository that will initiate the Fine-tuning Job.
    • The entry point script can be followed by any arguments you want to pass to it, such as configurations and Hyperparameters.
  • Repository Revision: The Git revision (branch, tag, or commit) you want to use for this Fine-tuning Job. The main branch will be used by default.

  • Node Count: The number of nodes you want to use for this Fine-tuning Job. Defaults to 1.

    • This will determine the amount of Accelerators you will have available for your Fine-tuning Job:
      • 1 node will allow you to use up to 8 Accelerators.
      • Using more than 1 node will make all 8 Accelerators per Node available to your Fine-tuning Job.
  • Accelerator Count: The number of Accelerators you want to use for this Fine-tuning Job. Must follow the logic described above. Defaults to 1.

  • Datasets: Can be selected from a dropdown list of the datasets you want to use for this Fine-tuning Job. You can add multiple datasets as well as specify the mount path within the Training Runtime (they will be mounted under /input). You can read more about this in the Uploading Datasets guide.

  • Environment Variables & Secrets: Add any environment variables you want to set for this Fine-tuning Job. These will be available to your fine-tuning code as environment variables within the Training Runtime.

    • You can also reference Secrets, which will be securely injected into the Fine-tuning Job’s Runtime.
  • Cluster: The cluster where the Fine-tuning workload will run on. It can be selected from a dropdown list of available clusters in your FlexAI account. A default cluster will be automatically selected for you if none is specified.

Field NameValue
NamenanoGPT-flexai-Fine-Tuning-quickstart
Repository URLhttps://github.com/flexaihq/nanogpt
Repository Revisionmain
CheckpointnanoGPT-flexai-console
Node Count1
Accelerator Count1
Entry Pointtrain.py config/train_shakespeare_char.py --dataset_dir=my_dataset --init_from=resume --out_dir=/input-checkpoint --max_iters=1500
DatasetsDataset: nanoGPT-dataset (from the CLI quickstart),
Mount Directory: my_dataset
ClusterYour organization’s designated cluster

The entry point script for this Training Job is train.py, and it expects the following arguments:

  • config/train_shakespeare_char.py:A configuration file, which contains the default Training Parameters.
  • --dataset_dir: The path within the /input directory of the Training Runtime where the Dataset files are located.
  • --out_dir: The directory to read the Checkpoint file from. It should be /input-checkpoint. This may sound counterintuitive, but we’re trying to keep the same logic as the original nanoGPT training script.
  • --init_from: This is the argument that will make the logic within train.py use the Checkpoint file as the starting point for the Fine-Tuning Job. Its value should be set to resume.
  • --max_iters: The maximum number of iterations to run the training script for (optional).
Entry Point script arguments details

These include any Environment Settings and Hyperparameters the training script may require. For this tutorial:

ParameterTypeDescription
config/train_shakespeare_char.pyEnvironment SettingA positional argument pointing to a configuration file used by nanoGPT’s train.py script to set the default Training Parameters
--out_dir=/output-checkpointEnvironment SettingThe output directory where the training script will write checkpoint files. In order to take advantage of FlexAI’s Managed Checkpoints feature, this should always be /output-checkpoint
--max_iters=1500HyperparameterThe maximum number of iterations to run the training script for. This is an optional hyperparameter that can be used to tweak the Fine-Tuning Job execution

After filling out the form, select the Submit button to start the Fine-tuning Job. You should get a confirmation message indicating that the Fine-tuning Job creation process has been initiated successfully.

The Start a new fine-tuning job form will close and you will be redirected to the Fine-tuning Jobs list page, where you can see your newly created Fine-tuning Job in the list.

The workflows for Fine-Tuning Jobs and Training Jobs are the same from this point onward. Below are the next steps you can take to monitor, manage, and get the output of your Fine-Tuning Job.

Since the workflows are similar, you can use the same monitoring tools and techniques for Fine-Tuning Jobs as you would for Training Jobs.

Visit the Monitoring a Training Job section of the Console Training quickstart guide to learn how to monitor your Fine-Tuning Job.

You can retrieve the output of your Fine-Tuning Job in the same way you would for a Training Job.

Visit the Getting a Training Job’s output section of the Console Training quickstart guide to learn how to retrieve the output of your Fine-Tuning Job.

As you can see, Checkpoints are a crucial part of the Fine-Tuning process. You can learn more about FlexAI Managed Checkpoints in the Managed Checkpoints page of the FlexAI Platform documentation.