diff --git a/utils/README.md b/utils/README.md index c7285ab..20b6cc8 100644 --- a/utils/README.md +++ b/utils/README.md @@ -102,3 +102,7 @@ optional arguments: ML Framework to test (default: pytorch) ``` + +## Multi Server SSH setup + +Refer to the README in the multi_server folder for details on how to setup multiple docker containers for SSH communication in a cluster. \ No newline at end of file diff --git a/utils/multi_server/README.md b/utils/multi_server/README.md new file mode 100644 index 0000000..ac66db9 --- /dev/null +++ b/utils/multi_server/README.md @@ -0,0 +1,81 @@ +## Enabling SSH Communication Between Docker Containers in a Cluster + +This guide outlines a method to enable multiple Docker containers in a cluster to communicate with each other using SSH. +This setup is a prerequisite for running a training workload on multiple servers together, which helps reduce overall training times. +Following these steps will allow you to quickly set up containers in a semi-automated way with minimal effort. + +### Prerequisites + +* A common folder accessible by all containers. +* An empty file named `nodes.txt` within the common folder. + +### Steps + +1. **Prepare the Shared Folder (IMPORTANT):** + - Create a shared folder accessible by all Docker containers in the cluster. + - Inside this folder, create an empty file named `nodes.txt`. + + **Example:** + + ``` + /root/common/nodes.txt + ``` + +2. **Populate `nodes.txt` with all the containers' IP addresses:** + - To obtain a container's IP address, run the following command inside the container: + + ```bash + ip route get 1 | awk '{print $7}' + ``` + + - Edit the `nodes.txt` file and add the IP address of each container you intend to use in the cluster, one per line. + + **Example `nodes.txt` content:** + + ``` + 172.17.0.1 + 172.17.0.2 + 172.17.0.3 + ``` + + + +3. **Collect SSH Keys (Run Once per container):** + - In **each** Docker container, execute the following command to collect SSH keys: + + ```bash + bash collect_ssh_keys.sh + ``` + +4. **Distribute SSH Keys (Run Once per container):** + - Proceed with this step only after having completed STEP 3 on **each** container. + - In **each** Docker container, execute the following command to distribute SSH keys: + + ```bash + bash distribute_ssh_keys.sh + ``` + +5. **Verify SSH Communication:** + - After running the previous steps, the containers should be able to communicate with each other using SSH. + - Use the `lsof -i` command to verify if the selected SSH port (usually 4022) is listening: + + ```bash + lsof -i + ``` + + A successful output will look similar to this: + + ``` + COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME + sshd 74 root 3u IPv4 66262911 0t0 TCP *:4022 (LISTEN) + sshd 74 root 4u IPv6 66262913 0t0 TCP *:4022 (LISTEN) + ``` + + - If the port is not listening, try changing the `PORT_ID` variable in both `collect_ssh_keys.sh` and `distribute_ssh_keys.sh` to a different port (e.g., 5022) and repeat STEPS 3 and 4. + +6. **Run Multi-Server Training:** + - Refer to the following guide for instructions on running multi-server training commands on each worker node: + + ``` + https://github.com/HabanaAI/Model-References/blob/master/PyTorch/generative_models/stable-diffusion/README.md#multi-server-training-examples + ``` diff --git a/utils/multi_server/collect_ssh_keys.sh b/utils/multi_server/collect_ssh_keys.sh new file mode 100755 index 0000000..69b4436 --- /dev/null +++ b/utils/multi_server/collect_ssh_keys.sh @@ -0,0 +1,19 @@ +#!/bin/bash +PORT_ID=4022 +COMMON_PATH=/root/common +sed -i "s/#Port 22/Port ${PORT_ID}/g" /etc/ssh/sshd_config +sed -i "s/# Port 22/ Port ${PORT_ID}/g" /etc/ssh/ssh_config +sed -i "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config +service ssh restart + +# Generate SSH key if it doesn't exist +if [ ! -f ~/.ssh/id_rsa ]; then + mkdir -p ~/.ssh + cd ~/.ssh + ssh-keygen -t rsa -b 4096 -f id_rsa -q -N "" +fi + +# Append public key and make it accessible to all nodes +cat ~/.ssh/id_rsa.pub >> "${COMMON_PATH}/public_keys" +echo "Public key appended to ${COMMON_PATH}/public_keys ..." +echo "Exiting ..." \ No newline at end of file diff --git a/utils/multi_server/distribute_ssh_keys.sh b/utils/multi_server/distribute_ssh_keys.sh new file mode 100755 index 0000000..abfd7db --- /dev/null +++ b/utils/multi_server/distribute_ssh_keys.sh @@ -0,0 +1,14 @@ +#!/bin/bash +PORT_ID=4022 +COMMON_PATH=/root/common +# Append public keys from all nodes to authorized_keys +cat "${COMMON_PATH}/public_keys" >> ~/.ssh/authorized_keys +service ssh restart +while read node; do + ssh-keyscan -p ${PORT_ID} -H ${node} >> ~/.ssh/known_hosts +done < "${COMMON_PATH}/nodes.txt" +#service ssh restart +echo "Added shared public_keys to authorized_keys and IPs to known_hosts ..." +echo "Checking Port status by running command 'lsof -i' ..." +lsof -i +echo "Exiting ..." \ No newline at end of file diff --git a/utils/multi_server/nodes.txt b/utils/multi_server/nodes.txt new file mode 100644 index 0000000..4b18832 --- /dev/null +++ b/utils/multi_server/nodes.txt @@ -0,0 +1,2 @@ +192.168.99.104 +192.168.99.106 \ No newline at end of file