HabanaAI · MohitIntel · Mar 26, 2024
diff --git a/utils/README.md b/utils/README.md
@@ -102,3 +102,7 @@ optional arguments:
                         ML Framework to test (default: pytorch)
 
 ```
+
+## Multi Server SSH setup
+
+Refer to the README in the multi_server folder for details on how to setup multiple docker containers for SSH communication in a cluster.
diff --git a/utils/multi_server/README.md b/utils/multi_server/README.md
@@ -0,0 +1,81 @@
+## Enabling SSH Communication Between Docker Containers in a Cluster
+
+This guide outlines a method to enable multiple Docker containers in a cluster to communicate with each other using SSH. 
+This setup is a prerequisite for running a training workload on multiple servers together, which helps reduce overall training times. 
+Following these steps will allow you to quickly set up containers in a semi-automated way with minimal effort.
+
+### Prerequisites
+
+* A common folder accessible by all containers.
+* An empty file named `nodes.txt` within the common folder.
+
+### Steps
+
+1. **Prepare the Shared Folder (IMPORTANT):**
+   - Create a shared folder accessible by all Docker containers in the cluster.
+   - Inside this folder, create an empty file named `nodes.txt`.
+
+   **Example:**
+
+   ```
+   /root/common/nodes.txt
+   ```
+
+2. **Populate `nodes.txt` with all the containers' IP addresses:**
+   - To obtain a container's IP address, run the following command inside the container:
+
+     ```bash
+     ip route get 1 | awk '{print $7}'
+     ```
+
+   - Edit the `nodes.txt` file and add the IP address of each container you intend to use in the cluster, one per line.
+
+   **Example `nodes.txt` content:**
+
+   ```
+   172.17.0.1
+   172.17.0.2
+   172.17.0.3
+   ```
+
+
+
+3. **Collect SSH Keys (Run Once per container):**
+   - In **each** Docker container, execute the following command to collect SSH keys:
+
+     ```bash
+     bash collect_ssh_keys.sh
+     ```
+
+4. **Distribute SSH Keys (Run Once per container):**
+   - Proceed with this step only after having completed STEP 3 on **each** container.
+   - In **each** Docker container, execute the following command to distribute SSH keys:
+
+     ```bash
+     bash distribute_ssh_keys.sh
+     ```
+
+5. **Verify SSH Communication:**
+   - After running the previous steps, the containers should be able to communicate with each other using SSH.
+   - Use the `lsof -i` command to verify if the selected SSH port (usually 4022) is listening:
+
+     ```bash
+     lsof -i
+     ```
+
+   A successful output will look similar to this:
+
+     ```
+     COMMAND PID USER   FD  TYPE DEVICE SIZE/OFF NODE NAME
+     sshd    74 root    3u  IPv4 66262911     0t0  TCP *:4022 (LISTEN)
+     sshd    74 root    4u  IPv6 66262913     0t0  TCP *:4022 (LISTEN)
+     ```
+
+   - If the port is not listening, try changing the `PORT_ID` variable in both `collect_ssh_keys.sh` and `distribute_ssh_keys.sh` to a different port (e.g., 5022) and repeat STEPS 3 and 4.
+
+6. **Run Multi-Server Training:**
+   - Refer to the following guide for instructions on running multi-server training commands on each worker node:
+
+     ```
+     https://github.com/HabanaAI/Model-References/blob/master/PyTorch/generative_models/stable-diffusion/README.md#multi-server-training-examples
+     ```
diff --git a/utils/multi_server/collect_ssh_keys.sh b/utils/multi_server/collect_ssh_keys.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+PORT_ID=4022
+COMMON_PATH=/root/common
+sed -i "s/#Port 22/Port ${PORT_ID}/g" /etc/ssh/sshd_config
+sed -i "s/#   Port 22/    Port ${PORT_ID}/g" /etc/ssh/ssh_config
+sed -i "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config
+service ssh restart
+
+# Generate SSH key if it doesn't exist
+if [ ! -f ~/.ssh/id_rsa ]; then
+    mkdir -p ~/.ssh
+    cd ~/.ssh
+    ssh-keygen -t rsa -b 4096 -f id_rsa -q -N ""
+fi
+
+# Append public key and make it accessible to all nodes
+cat ~/.ssh/id_rsa.pub >> "${COMMON_PATH}/public_keys"
+echo "Public key appended to ${COMMON_PATH}/public_keys ..."
+echo "Exiting ..."
diff --git a/utils/multi_server/distribute_ssh_keys.sh b/utils/multi_server/distribute_ssh_keys.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+PORT_ID=4022
+COMMON_PATH=/root/common
+# Append public keys from all nodes to authorized_keys
+cat "${COMMON_PATH}/public_keys" >> ~/.ssh/authorized_keys
+service ssh restart
+while read node; do
+    ssh-keyscan -p ${PORT_ID} -H ${node} >> ~/.ssh/known_hosts
+done < "${COMMON_PATH}/nodes.txt"
+#service ssh restart
+echo "Added shared public_keys to authorized_keys and IPs to known_hosts ..."
+echo "Checking Port status by running command 'lsof -i' ..."
+lsof -i
+echo "Exiting ..."
diff --git a/utils/multi_server/nodes.txt b/utils/multi_server/nodes.txt
@@ -0,0 +1,2 @@
+192.168.99.104
+192.168.99.106