Interactive Training Session

The flexai training debug-ssh command allows you to start an Interactive Training Job on a Training Runtime similar to that of regular Training Jobs. This command will allocate the required resources and then set up an Interactive Training Runtime you can connect to via SSH.

Having access to this debug Interactive Training Runtime will allow you to iterate quickly by enabling you to edit scripts, modify files, review logs and outputs, push your changes to your GitHub repository, and in general, have an inside out look at your Training results as they happen, before you commit to running a Training Job.

Pre-requisites

The environment in which the flexai training debug-ssh command is run should either:

Have an ssh-agent running and SSH keys loaded into it, or
Have an SSH key pair available

Using an ssh-agent

If an ssh-agent is not running and you would like to use one to be able to push your change from the Interactive Training Runtime to GitHub, run eval $(ssh-agent) in your terminal, and load your keys to it by running ssh-add <path_to_private_key>. You can then confirm the keys have been loaded using ssh-add -l and start your Interactive Training Job - it will automatically use your ssh-agent.

If SSH key pairs authenticating you to GitHub are loaded into your ssh-agent, you will be able to push your changes to GitHub from the Interactive Training Runtime by enabling the ForwardAgent option in your SSH configuration file (~/.ssh/config):

Host debug-gw.flex.ai
  ForwardAgent yes

To verify if an ssh-agent is running in your local environment and has your private keys loaded, run:

ssh-add -L

You can load more keys into your ssh-agent with:

ssh-add path/to/private/key

Using SSH key pairs

When not using the ssh-agent, you must set the path to the public key you will use through the --authorized-keys flag, e.g.:

flexai training debug-ssh --repository-url https://github.com/flexaihq/nanoGPT --vscode --authorized-keys ~/.ssh/id_ed25519.pub

Starting an Interactive Training Job

In its simplest form, starting an Interactive Training Job will only require running the debug-ssh subcommand with the --repository-url flag, e.g.:

flexai training debug-ssh --repository-url https://github.com/flexaihq/nanoGPT --vscode

Information on the various stages of the Interactive Training Job will be displayed in the terminal. Once the job is ready, you will be provided with the SSH command to connect to the Interactive Training Runtime, example output:

Interactive training interactive-training-9fe8631b-aa6f-4f92-8ed4-b8a16df810b5 launching...
✅ Looking for an interactive training builder

[Node 0] To connect using ssh:
    ssh -o ForwardAgent=yes -p 44417 flexai@debug-gw.flex.ai
[Node 0] To open VSCode, click the following URL (or open in your browser if not clickable):
    "vscode://vscode-remote/ssh-remote+flexai@debug-gw.flex.ai:44417/workspace?windowId=_blank"
✅ Automatically configuring ~/.ssh/known_hosts

Attaching through VSCode

You can also attach to the Interactive Training Runtime using VSCode. To do so, you will need to install the Remote - SSH 🔗 extension.

Once installed, you can connect to the Interactive Training Runtime by simply clicking on the VSCode URL provided in the terminal output. This will open a new VSCode window with the SSH connection to the Interactive Training Runtime already established.

The --vscode flag can be used to automatically open a VSCode window into the Interactive Training Runtime, useful if the VSCode URL is not clickable in your terminal.

Lifetime of an Interactive Training Job

Session timeout

Interactive Training Jobs are automatically stopped after the specified session timeout (can be set with --session-timeout. Defaults to 600 seconds, which is 10 minutes) if there are no active SSH session into the Interactive Training Runtime.

Stopping a running Interactive Training Job

Interactive Training Jobs can be manually stopped like regular Training Jobs by using the flexai training stop command:

flexai training stop <interactive_training_job_name>

Troubleshooting

The Interactive Training Job fails to start

You can check the Interactive Training session logs, just like a regular Training Job:

flexai training logs <training_job_name>

The same applies to inspecting a Training Session to get more information on its Lifecycle:

flexai training inspect <training_job_name>

SSH connection issues

When using an `ssh-agent`

Make sure that the ssh-agent is running and has the correct keys loaded. You can verify this by running ssh-add -L and checking that the keys you expect are listed.

A common issue is that the ssh-agent in your environment is not the same as the one used by the Interactive Training Job. This can happen if you have multiple ssh-agent instances running and the SSH_AUTH_SOCK environment variable points to a different agent than the one configured inside your ssh configuration file (~/.ssh/config). When loading keys into your agent with ssh-add /path/to/your-key or verifying which keys are loaded with ssh-add -L, make sure that the SSH_AUTH_SOCK environment variable is set to the same socket path as the one used by the Interactive Training Job, which you can find by running:

ssh -G debug-gw.flex.ai | awk '/^identityagent/ {print $2}'