-
Notifications
You must be signed in to change notification settings - Fork 641
DeepSpeed Multi node
Romain Beaumont edited this page May 26, 2021
·
1 revision
Running deepspeed with multiple node with GPUs allow to increase the batch size by xN with N the number of machines and the sample/s increase almost linearly.
Detailled information are present in https://www.deepspeed.ai/getting-started/
For dalle pytorch in particular, add a --hostfile=deepspeed_host argument right after deepspeed in the command line. deepspeed_host file should look like this:
my_machine1 slots=2
my_machine2 slots=2
The slot number is the number of GPUs.
my_machine1 and my_machine2 should be machine that can be connected to with ssh my_machine1
without password.
You need to define them in ~/.ssh/config like this:
Host gpu2
User youruser
HostName 1.2.3.4
Port 22
If using a virtual environment, you will need to do source .env/bin/activate
in the bashrc file of each machine.