Getting a Training Job's Details

Listing Training Jobs

You can use the list command to get a table with general information about all the Training Jobs you have access to through your FlexAI account:

flexai training list

This provides an output similar to the following:

NAME                    | DEVICE | NODE | ACCELERATOR |     DATASET     |             REPOSITORY              |  STATUS  | AGE
------------------------+--------+------+-------------+-----------------+-------------------------------------+----------+------
quickstart-training-job | nvidia | 1    | 1           | nanoGPT-dataset | https://github.com/flexaihq/nanogpt | building | 15s

Viewing Logs

Once a Training Job begins, logs emitted during the process can be retrieved by running the flexai training logs command:

flexai training logs quickstart-training-job

This will output a stream of logs including both the FlexAI runtime execution logs and any stdout and stderr messages emitted by the training scripts.

Getting detailed Training Job information

You can have a deeper look at the Training Job status using the flexai training inspect <training_job_name> command. Especially useful for debugging purposes:

flexai training inspect quickstart-training-job

Below you will find an example of the output you will get when running the inspect command:

Output example:

kind: Training
metadata:
    name: quickstart-training-job
    id: 75179cc2-ec63-4f93-b4da-44e49ea86049
    creatorUserID: 16e2894c-c81b-4a15-91d9-0e2aae00a317
    ownerOrgID: 108dddec-e922-49b8-a466-4d7ed5dcc746
config:
    device: nvidia
    nodes: 1
    accelerator: 1
    entrypoint:
        - train.py
        - config/train_shakespeare_char.py
        - --out_dir=/output-checkpoint
        - --max_iters=1500
    datasetsNames:
        - nanoGPT-dataset
    checkpointName: ""
    sourceName: ""
    repositoryURL: https://github.com/flexaihq/nanogpt
    repositoryRevision: main
    secrets: []
    environment: []
runtime:
    status: succeeded
    queuePosition: 0
    repositoryRevisionSha: 116799dbae7b0fe33caf1b90f73a72f84bc32adc
    selectedAgentId: k8s-training-sesterce-001-CLIENT-PROD-client-prod
    lifecycleEvents:
        - type: AgentSelection
          status: ResponseReceived
          message: |-
            Cluster Scheduling result{
              Name: aws-cloud
              AgentID: k8s-training-aws-001-CLIENT-PROD-client-prod
              Response: NoAnswer
              Conditions: [NonSchedulable: NoAnswer]
            }
          raisedAt: "2025-06-30T11:41:54Z"
        - type: AgentSelection
          status: ResponseReceived
          message: |-
            Cluster Scheduling result{
              Name: sesterce-h100-bm-01
              AgentID: k8s-training-sesterce-001-CLIENT-PROD-client-prod
              Response: OK
              Conditions: []
            }
          raisedAt: "2025-06-30T11:41:54Z"
        - type: AgentSelection
          status: ResponseReceived
          message: |-
            Cluster Scheduling result{
              Name: sesterce-h200-bm-01
              AgentID: k8s-training-sesterce-002-CLIENT-PROD-client-prod
              Response: NoAnswer
              Conditions: [NonSchedulable: NoAnswer]
            }
          raisedAt: "2025-06-30T11:41:54Z"
        - type: AgentSelection
          status: ResponseReceived
          message: |-
            Cluster Scheduling result{
              Name: sesterce-l40s-bm-01
              AgentID: k8s-training-sesterce-003-CLIENT-PROD-client-prod
              Response: NoAnswer
              Conditions: [NonSchedulable: NoAnswer]
            }
          raisedAt: "2025-06-30T11:41:54Z"
        - type: AgentSelection
          status: ResponseReceived
          message: |-
            Cluster Scheduling result{
              Name: sesterce-a100-bm-01
              AgentID: k8s-training-sesterce-004-CLIENT-PROD-client-prod
              Response: NoAnswer
              Conditions: [NonSchedulable: NoAnswer]
            }
          raisedAt: "2025-06-30T11:41:54Z"
        - type: AgentSelection
          status: ResponseReceived
          message: |-
            Cluster Scheduling result{
              Name: k8s-training-smc-001
              AgentID: k8s-training-smc-001-CLIENT-PROD-client-prod
              Response: NoAnswer
              Conditions: [NonSchedulable: NoAnswer, OrgNotAuthorized]
            }
          raisedAt: "2025-06-30T11:41:54Z"
        - type: AgentSelection
          status: Completed
          message: Selected agent k8s-training-sesterce-001-CLIENT-PROD-client-prod
          raisedAt: "2025-06-30T11:41:54Z"
        - type: BuildSubmission
          status: Succeeded
          message: Build request sent to flex-agent
          raisedAt: "2025-06-30T11:41:54Z"
        - type: BuildExecution
          status: Succeeded
          message: Build completed with image rg.fr-par.scw.cloud/paas-trainings-client-prod/9f9c379c-8d46-419b-8bf5-d0b0986a6dd9-arch_nvidia-1x1@sha256:0d854f75f698a549d2a8a0e024e930383b885bdac2863ee0cf74ebdc8a8f358c
          raisedAt: "2025-06-30T11:41:54Z"
        - type: TrainingPreparation
          status: Succeeded
          message: Training trainings-client-prod/training-75b79cc2-ec63-4f93-b4da-44e49a4a6049-zqg6d created
          raisedAt: "2025-06-30T11:41:54Z"
        - type: TrainingExecution
          status: InProgress
          message: Training in progress
          raisedAt: "2025-06-30T11:42:00Z"
        - type: TrainingExecution
          status: Succeeded
          message: Training complete, output available
          raisedAt: "2025-06-30T11:43:48Z"
    createdAt: "2025-06-30T11:41:54Z"
    lastUpdate: "2025-06-30T11:43:48Z"

Infrastructure & Workload Metrics

The Infrastructure Metrics Dashboard page provides insights into how you can monitor infrastructure resources used by your Workloads.
The TensorBoard page provides more information on how you can use FlexAI’s hosted TensorBoard to visualize and analyze your Workloads’ performance.