Skip to content

Monitoring a Training Job's Progress

You can select the gear icon ⚙️ (labeled as Configure) in the Actions field of the Training Jobs list page. This will open a “Details” drawer. Select the Logs tab to view the messages emitted to the standard output (stdout) created by the runtime environment build process and your Training Job’s code.

You can use the Search bar input field to filter the logs by a specific keyword. This is useful to quickly find relevant information in the logs.

You can monitor the infrastructure metrics of your Training Job using the FlexAI Infrastructure Monitor. This will give you insights into the resource usage of your Training Job, such as CPU and memory usage, disk I/O, and network traffic.

Access FlexAI’s Infrastructure Monitor by visiting https://dashboards.flex.ai/ 🔗. Visit the FlexAI Infrastructure Monitor page to learn more.

You can also use FlexAI’s hosted TensorBoard to visualize the training process of your model. TensorBoard provides a suite of tools for inspecting and understanding your Training Job’s evolution.

Visit https://dashboards.flex.ai/tensorboard 🔗 and log in using your credentials. Learn more at the FlexAI TensorBoard page.

After a few minutes, your Training Job should complete successfully. The next step of this Quickstart Tutorial will guide you through the process of getting its output results.