Skip to content

Getting a Training Job's Details

You can use the list command to get a table with general information about all the Training Jobs you have access to through your FlexAI account:

Terminal window
flexai training list

This provides an output similar to the following:

NAME | DEVICE | NODE | ACCELERATOR | DATASET | REPOSITORY | STATUS | AGE
------------------------+--------+------+-------------+-----------------+-------------------------------------+----------+------
quickstart-training-job | nvidia | 1 | 1 | nanoGPT-dataset | https://github.com/flexaihq/nanogpt | building | 15s

Once a Training Job begins, logs emitted during the process can be retrieved by running the flexai training logs command:

Terminal window
flexai training logs quickstart-training-job

This will output a stream of logs including both the FlexAI runtime execution logs and any stdout and stderr messages emitted by the training scripts.

You can have a deeper look at the Training Job status using the flexai training inspect <training_job_name> command. Especially useful for debugging purposes:

Terminal window
flexai training inspect quickstart-training-job

Below you will find an example of the output you will get when running the inspect command:

Output example:
YAML output (Default)
kind: Training
metadata:
name: quickstart-training-job
id: 75179cc2-ec63-4f93-b4da-44e49ea86049
creatorUserID: 16e2894c-c81b-4a15-91d9-0e2aae00a317
ownerOrgID: 108dddec-e922-49b8-a466-4d7ed5dcc746
config:
device: nvidia
nodes: 1
accelerator: 1
entrypoint:
- train.py
- config/train_shakespeare_char.py
- --out_dir=/output-checkpoint
- --max_iters=1500
datasetsNames:
- nanoGPT-dataset
checkpointName: ""
sourceName: ""
repositoryURL: https://github.com/flexaihq/nanogpt
repositoryRevision: main
secrets: []
environment: []
runtime:
status: succeeded
queuePosition: 0
repositoryRevisionSha: 116799dbae7b0fe33caf1b90f73a72f84bc32adc
selectedAgentId: k8s-training-sesterce-001-CLIENT-PROD-client-prod
lifecycleEvents:
- type: AgentSelection
status: ResponseReceived
message: |-
Cluster Scheduling result{
Name: aws-cloud
AgentID: k8s-training-aws-001-CLIENT-PROD-client-prod
Response: NoAnswer
Conditions: [NonSchedulable: NoAnswer]
}
raisedAt: "2025-06-30T11:41:54Z"
- type: AgentSelection
status: ResponseReceived
message: |-
Cluster Scheduling result{
Name: sesterce-h100-bm-01
AgentID: k8s-training-sesterce-001-CLIENT-PROD-client-prod
Response: OK
Conditions: []
}
raisedAt: "2025-06-30T11:41:54Z"
- type: AgentSelection
status: ResponseReceived
message: |-
Cluster Scheduling result{
Name: sesterce-h200-bm-01
AgentID: k8s-training-sesterce-002-CLIENT-PROD-client-prod
Response: NoAnswer
Conditions: [NonSchedulable: NoAnswer]
}
raisedAt: "2025-06-30T11:41:54Z"
- type: AgentSelection
status: ResponseReceived
message: |-
Cluster Scheduling result{
Name: sesterce-l40s-bm-01
AgentID: k8s-training-sesterce-003-CLIENT-PROD-client-prod
Response: NoAnswer
Conditions: [NonSchedulable: NoAnswer]
}
raisedAt: "2025-06-30T11:41:54Z"
- type: AgentSelection
status: ResponseReceived
message: |-
Cluster Scheduling result{
Name: sesterce-a100-bm-01
AgentID: k8s-training-sesterce-004-CLIENT-PROD-client-prod
Response: NoAnswer
Conditions: [NonSchedulable: NoAnswer]
}
raisedAt: "2025-06-30T11:41:54Z"
- type: AgentSelection
status: ResponseReceived
message: |-
Cluster Scheduling result{
Name: k8s-training-smc-001
AgentID: k8s-training-smc-001-CLIENT-PROD-client-prod
Response: NoAnswer
Conditions: [NonSchedulable: NoAnswer, OrgNotAuthorized]
}
raisedAt: "2025-06-30T11:41:54Z"
- type: AgentSelection
status: Completed
message: Selected agent k8s-training-sesterce-001-CLIENT-PROD-client-prod
raisedAt: "2025-06-30T11:41:54Z"
- type: BuildSubmission
status: Succeeded
message: Build request sent to flex-agent
raisedAt: "2025-06-30T11:41:54Z"
- type: BuildExecution
status: Succeeded
message: Build completed with image rg.fr-par.scw.cloud/paas-trainings-client-prod/9f9c379c-8d46-419b-8bf5-d0b0986a6dd9-arch_nvidia-1x1@sha256:0d854f75f698a549d2a8a0e024e930383b885bdac2863ee0cf74ebdc8a8f358c
raisedAt: "2025-06-30T11:41:54Z"
- type: TrainingPreparation
status: Succeeded
message: Training trainings-client-prod/training-75b79cc2-ec63-4f93-b4da-44e49a4a6049-zqg6d created
raisedAt: "2025-06-30T11:41:54Z"
- type: TrainingExecution
status: InProgress
message: Training in progress
raisedAt: "2025-06-30T11:42:00Z"
- type: TrainingExecution
status: Succeeded
message: Training complete, output available
raisedAt: "2025-06-30T11:43:48Z"
createdAt: "2025-06-30T11:41:54Z"
lastUpdate: "2025-06-30T11:43:48Z"

  • The Infrastructure Metrics Dashboard page provides insights into how you can monitor infrastructure resources used by your Workloads.
  • The TensorBoard page provides more information on how you can use FlexAI’s hosted TensorBoard to visualize and analyze your Workloads’ performance.