Getting a Training Job's Details
Listing Training Jobs
Section titled “Listing Training Jobs”You can use the list
command to get a table with general information about all the Training Jobs you have access to through your FlexAI account:
flexai training list
This provides an output similar to the following:
NAME | DEVICE | NODE | ACCELERATOR | DATASET | REPOSITORY | STATUS | AGE------------------------+--------+------+-------------+-----------------+-------------------------------------+----------+------quickstart-training-job | nvidia | 1 | 1 | nanoGPT-dataset | https://github.com/flexaihq/nanogpt | building | 15s
Viewing Logs
Section titled “Viewing Logs”Once a Training Job begins, logs emitted during the process can be retrieved by running the flexai training logs
command:
flexai training logs quickstart-training-job
This will output a stream of logs including both the FlexAI runtime execution logs and any stdout
and stderr
messages emitted by the training scripts.
Getting detailed Training Job information
Section titled “Getting detailed Training Job information”You can have a deeper look at the Training Job status using the flexai training inspect <training_job_name>
command. Especially useful for debugging purposes:
flexai training inspect quickstart-training-job
Below you will find an example of the output you will get when running the inspect
command:
Output example:
kind: Trainingmetadata: name: quickstart-training-job id: 75179cc2-ec63-4f93-b4da-44e49ea86049 creatorUserID: 16e2894c-c81b-4a15-91d9-0e2aae00a317 ownerOrgID: 108dddec-e922-49b8-a466-4d7ed5dcc746config: device: nvidia nodes: 1 accelerator: 1 entrypoint: - train.py - config/train_shakespeare_char.py - --out_dir=/output-checkpoint - --max_iters=1500 datasetsNames: - nanoGPT-dataset checkpointName: "" sourceName: "" repositoryURL: https://github.com/flexaihq/nanogpt repositoryRevision: main secrets: [] environment: []runtime: status: succeeded queuePosition: 0 repositoryRevisionSha: 116799dbae7b0fe33caf1b90f73a72f84bc32adc selectedAgentId: k8s-training-sesterce-001-CLIENT-PROD-client-prod lifecycleEvents: - type: AgentSelection status: ResponseReceived message: |- Cluster Scheduling result{ Name: aws-cloud AgentID: k8s-training-aws-001-CLIENT-PROD-client-prod Response: NoAnswer Conditions: [NonSchedulable: NoAnswer] } raisedAt: "2025-06-30T11:41:54Z" - type: AgentSelection status: ResponseReceived message: |- Cluster Scheduling result{ Name: sesterce-h100-bm-01 AgentID: k8s-training-sesterce-001-CLIENT-PROD-client-prod Response: OK Conditions: [] } raisedAt: "2025-06-30T11:41:54Z" - type: AgentSelection status: ResponseReceived message: |- Cluster Scheduling result{ Name: sesterce-h200-bm-01 AgentID: k8s-training-sesterce-002-CLIENT-PROD-client-prod Response: NoAnswer Conditions: [NonSchedulable: NoAnswer] } raisedAt: "2025-06-30T11:41:54Z" - type: AgentSelection status: ResponseReceived message: |- Cluster Scheduling result{ Name: sesterce-l40s-bm-01 AgentID: k8s-training-sesterce-003-CLIENT-PROD-client-prod Response: NoAnswer Conditions: [NonSchedulable: NoAnswer] } raisedAt: "2025-06-30T11:41:54Z" - type: AgentSelection status: ResponseReceived message: |- Cluster Scheduling result{ Name: sesterce-a100-bm-01 AgentID: k8s-training-sesterce-004-CLIENT-PROD-client-prod Response: NoAnswer Conditions: [NonSchedulable: NoAnswer] } raisedAt: "2025-06-30T11:41:54Z" - type: AgentSelection status: ResponseReceived message: |- Cluster Scheduling result{ Name: k8s-training-smc-001 AgentID: k8s-training-smc-001-CLIENT-PROD-client-prod Response: NoAnswer Conditions: [NonSchedulable: NoAnswer, OrgNotAuthorized] } raisedAt: "2025-06-30T11:41:54Z" - type: AgentSelection status: Completed message: Selected agent k8s-training-sesterce-001-CLIENT-PROD-client-prod raisedAt: "2025-06-30T11:41:54Z" - type: BuildSubmission status: Succeeded message: Build request sent to flex-agent raisedAt: "2025-06-30T11:41:54Z" - type: BuildExecution status: Succeeded message: Build completed with image rg.fr-par.scw.cloud/paas-trainings-client-prod/9f9c379c-8d46-419b-8bf5-d0b0986a6dd9-arch_nvidia-1x1@sha256:0d854f75f698a549d2a8a0e024e930383b885bdac2863ee0cf74ebdc8a8f358c raisedAt: "2025-06-30T11:41:54Z" - type: TrainingPreparation status: Succeeded message: Training trainings-client-prod/training-75b79cc2-ec63-4f93-b4da-44e49a4a6049-zqg6d created raisedAt: "2025-06-30T11:41:54Z" - type: TrainingExecution status: InProgress message: Training in progress raisedAt: "2025-06-30T11:42:00Z" - type: TrainingExecution status: Succeeded message: Training complete, output available raisedAt: "2025-06-30T11:43:48Z" createdAt: "2025-06-30T11:41:54Z" lastUpdate: "2025-06-30T11:43:48Z"
Infrastructure & Workload Metrics
Section titled “Infrastructure & Workload Metrics”- The Infrastructure Metrics Dashboard page provides insights into how you can monitor infrastructure resources used by your Workloads.
- The TensorBoard page provides more information on how you can use FlexAI’s hosted TensorBoard to visualize and analyze your Workloads’ performance.