A script for combining the power of tmux with the TPU VMs. The script currently handles both TPU-v3, TPU-v4 and TPU-v5. The main idea is to use tmux for executing identical commands on multiple VMs.
# Download ttconnect
wget https://raw.githubusercontent.com/peregilk/ttconnect/main/ttconnect
# Make the program executable
chmod a+x ttconnect
# Optionally copy it to a place in your path (like /usr/local/bin/)
The script is made for handling TPU pods. It will automatically open a tmux window with a tile for each of the TPU VMs, allowing them to be controlled both in parallel and individually.
# Open a connection to an already existing TPU VM or TPU-VMs.
# If one is not provided, it will default to us-central2-b
./ttconnect TPU-name [zone]
This command will open connections to all the workers in a tmux with split panes. A typical workspace for a v4-32 looks like this:
Depending upon how many windows that are open, it might be beneficial to change the layout mode. You can cycle through the five different layout modes with this command:
C-b <space>
The default setting is syncronized panes. Whatever you type in one pane, will then happen in all the panes. However, if you like to make a change only to one of the TPUs, you can turn off this behaviour by setting:
C-b: setw synchronize-panes off
It might happen that one of the tpus dies for some reason, and it might not be the one that is in focus. To target specific panes there are a few tricks that I like to use. Firstly you can always go to another pane using ctrl-b <arrow>
. However, in many cases this pane is too small for working. If you have multiple VMs running, the first thing would then be to switch to the layout main-horisontal
(see above). After you have done this, use the following command to see the id of each of the panes:
C-b q
When you know the id of the target pane, you can use the command below setting the N=id:
C-b:swap-pane -t N
You can detach from the windows by doing
C-b d
However, if you really want to zap the entire window, you will have to do:
C-b: kill-window
You can then use ttconnect
to connect to the same pod again with a fresh login.
In rare cases, some scripts crashes. If you dont want to recreate the TPUs/VMs, this is really useful commands.
gcloud alpha compute tpus tpu-vm ssh MyName --project=MyProject-11111 --zone=MyZone --worker=all --command="sudo pkill -9 python"
In some very rare cases, I have experienced that there still can be stuck programs that prevents the training scripts to restart. This is my last trick:
gcloud alpha compute tpus tpu-vm ssh MyName --project=MyProject-11111 --zone=MyZone --worker=all --command="ps ax | grep python | grep -v grep | awk '{print \$1}' | xargs -r sudo kill -9"
For more advanced use, please refer to the tmux documentation.
This is really just an tmux tips but it seems like a lot of tmux users simply is not aware of its most useful feature. List all sessions:
C-b s
Feel free to modify the script, and to add features. If you come up with improvements, I will be glad to add them into the script. Please send any comments to [email protected].