> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Creating a Training Job

> Create and configure a new training job on FlexAI

With a Dataset available on your FlexAI account, you can now create a Training Job that will use it.

## The Model's repository

For this tutorial we will use the [**FlexAI fork of the nanoGPT repository**](https://github.com/flexaihq/nanogpt), originally created by [Andrej Karpathy](https://github.com/karpathy).

A Training Job requires at least a **Name**, a link to a **GitHub repository** where its code resides, and the ***path to the entry point script*** that will initiate the Workload.

In addition, the *entry point script* can be followed by any arguments required, such as configuration files or Hyperparameters.

### Entry Point script arguments

The entry point script path for this quickstart tutorial is [`./train.py`](https://github.com/flexaihq/nanoGPT/blob/main/train.py), and it expects the following arguments:

* `config/train_shakespeare_char.py`: A configuration file, which contains the default Workload Parameters.
* `--dataset_dir`: The path within the `/input` directory of the Workload Runtime where the Dataset files are located.
* `--out_dir`: The output directory, which will be mounted into the Workload Runtime as `/output-checkpoint`.
* `--max_iters`: The maximum number of iterations to run the Workload script for (optional).

<Accordion title="Entry Point script arguments details">
  These include any **Environment Settings** and **Hyperparameters** the entry point script may require. For this tutorial:

  | Parameter                          | Type                | Description                                                                                                                                                                                    |
  | ---------------------------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | `config/train_shakespeare_char.py` | Environment Setting | A positional argument pointing to a configuration file used by nanoGPT's `train.py` script to set the default Workload Parameters                                                              |
  | `--out_dir=/output-checkpoint`     | Environment Setting | The output directory where the Workload script will write checkpoint files. In order to take advantage of FlexAI's Managed Checkpoints feature, this **`should always be /output-checkpoint`** |
  | `--max_iters=1500`                 | Hyperparameter      | The maximum number of iterations to run the Workload script for. This is an optional hyperparameter that can be used to tweak the Workload execution                                           |
</Accordion>

## Starting a new Training Job

<Tabs>
  <Tab title="Using the FlexAI Console">
    The **Start a new Training job** form consists of a set of required and optional fields that you can use to customize your deployment.

    ### To open the **Start a new Training job** form

    Either:

    * Follow the direct link to the [Start a new Training Job](https://console.flex.ai/training/create) page.

    Or

    <Steps>
      <Step title="Navigate to Training">
        Navigate to the **Training** section from either the navigation bar or the card on the home page.
      </Step>

      <Step title="Create new job">
        Select the **New** button to display the creation form.
      </Step>
    </Steps>

    A drawer menu with the creation form will be displayed.

    ### Required Fields

    * **Name**: Your Training Job name. Should follow the [resource naming conventions](/best-practices/resource-naming-conventions/).
    * **Repository URL**: The URL of the Git repository containing your Training code.
    * **Entry Point**: The path to the entry point script in your repository that will initiate the Training Job.
      * The *entry point script* can be followed by any arguments you want to pass to it, such as configurations and Hyperparameters. **Value**: `train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500`.

    ### Other fields

    * **Repository Revision**: The Git revision (branch, tag, or commit) you want to use for this Training Job. The `main` branch will be used by default.
    * **Node Count**: The number of nodes you want to use for this Training Job. Defaults to `1`.
      * This will determine the amount of Accelerators you will have available for your Training Job:
        * 1 node will allow you to use up to 8 Accelerators.
        * Using more than 1 node will make all 8 Accelerators per Node available to your Training Job.
    * **Accelerator Count**: The number of Accelerators you want to use for this Training Job. Must follow the logic described above. Defaults to `1`.
    * **Datasets**: Can be selected from a dropdown list of the datasets you want to use for this Training Job. You can add multiple datasets as well as specify the mount path within the Training Runtime (they will be mounted under `/input`). You can read more about this in the [Pushing a Dataset guide](/core-services/training/quickstart/uploading-a-dataset/).

          <Note>
            Don't forget to select the "Add" button after picking a Dataset, otherwise it won't be added to the Training Job.
          </Note>
    * **Environment Variables & Secrets**: Add any environment variables you want to set for this Training Job. These will be available to your Training code as environment variables within the Training Runtime.
      * You can also reference **Secrets**, which will be securely injected into the Training Job's Runtime.
    * **Cluster**: The cluster where the Training workload will run on. It can be selected from a dropdown list of available clusters in your FlexAI account. A default cluster will be automatically selected for you if none is specified.

    ### Form Values

    | Field Name              | Value                                                                                                              |
    | ----------------------- | ------------------------------------------------------------------------------------------------------------------ |
    | **Name**                | `nanoGPT-flexai-console`                                                                                           |
    | **Repository URL**      | `https://github.com/flexaihq/nanogpt`                                                                              |
    | **Repository Revision** | `main`                                                                                                             |
    | **Node Count**          | `1`                                                                                                                |
    | **Accelerator Count**   | `1`                                                                                                                |
    | **Entry Point**         | `train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500` |
    | **Datasets**            | Dataset: `nanoGPT-dataset` (from the CLI quickstart), Mount Directory: `my_dataset`                                |
    | **Cluster**             | *Your organization's designated cluster*                                                                           |

    After filling out the form, select the **Submit** button to start the Training Job. You should get a confirmation message indicating that the Training Job creation process has been initiated successfully.

    The **Start a new training job** form will close and you will be redirected to the Training Jobs list page, where you can see your newly created Training Job in the list.
  </Tab>

  <Tab title="Using the FlexAI CLI">
    Considering the minimum required elements for the creation of a Training Job, the following command will initiate its creation and start it running immediately:

    ```bash theme={null}
    flexai training run quickstart-training-job \
        --dataset nanoGPT-dataset=my_dataset \
        --repository-url https://github.com/flexaihq/nanogpt \
        -- train.py config/train_shakespeare_char.py --dataset_dir=my_dataset --out_dir=/output-checkpoint --max_iters=1500
    ```

    <Accordion title="Zooming into the `flexai training run` arguments & flags">
      #### Arguments

      | FlexAI command Argument | Value                     | Description                  |
      | ----------------------- | ------------------------- | ---------------------------- |
      | **Training Job Name**   | `quickstart-training-job` | The name of the Training Job |

      #### Flags

      | Flag                   | Value                                 | Description                                                                                        |
      | ---------------------- | ------------------------------------- | -------------------------------------------------------------------------------------------------- |
      | **Dataset**            | `my_dataset=nanoGPT-dataset`          | The mount path followed by the Dataset name. The dataset will be accessible at `/input/my_dataset` |
      | **Repository URL**     | `https://github.com/flexaihq/nanogpt` | The URL of the GitHub repository containing the workload's code                                    |
      | **Entry Point Script** | `train.py`                            | The path of entry point Training script as defined by the repository                               |

      #### Entry Point script arguments

      These include any **Environment Settings** and **Hyperparameters** the entry point script may require. Keep in mind that these are specific to the code you're running:

      | Entry point script argument        | Type                | Description                                                                                                                                                                           |
      | ---------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
      | `config/train_shakespeare_char.py` | Environment Setting | A positional argument pointing to a configuration file used by nanoGPT's `train.py` script to set the default runtime Parameters                                                      |
      | `--out_dir=/output-checkpoint`     | Environment Setting | The output directory where the script will write checkpoint files. In order to take advantage of FlexAI's Managed Checkpoints feature, this **`should always be /output-checkpoint`** |
      | `--max_iters=1500`                 | Hyperparameter      | The maximum number of iterations to run. This is an optional hyperparameter that can be used to tweak the Workload execution                                                          |
    </Accordion>
  </Tab>
</Tabs>

## Up next

Next you'll learn how to get a Training Job's details and monitor its progress.
