Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Server SSH communication #7

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions utils/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,7 @@ optional arguments:
ML Framework to test (default: pytorch)

```

## Multi Server SSH setup

Refer to the README in the multi_server folder for details on how to setup multiple docker containers for SSH communication in a cluster.
81 changes: 81 additions & 0 deletions utils/multi_server/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
## Enabling SSH Communication Between Docker Containers in a Cluster

This guide outlines a method to enable multiple Docker containers in a cluster to communicate with each other using SSH.
This setup is a prerequisite for running a training workload on multiple servers together, which helps reduce overall training times.
Following these steps will allow you to quickly set up containers in a semi-automated way with minimal effort.

### Prerequisites

* A common folder accessible by all containers.
* An empty file named `nodes.txt` within the common folder.

### Steps

1. **Prepare the Shared Folder (IMPORTANT):**
- Create a shared folder accessible by all Docker containers in the cluster.
- Inside this folder, create an empty file named `nodes.txt`.

**Example:**

```
/root/common/nodes.txt
```

2. **Populate `nodes.txt` with all the containers' IP addresses:**
- To obtain a container's IP address, run the following command inside the container:

```bash
ip route get 1 | awk '{print $7}'
```

- Edit the `nodes.txt` file and add the IP address of each container you intend to use in the cluster, one per line.

**Example `nodes.txt` content:**

```
172.17.0.1
172.17.0.2
172.17.0.3
```



3. **Collect SSH Keys (Run Once per container):**
- In **each** Docker container, execute the following command to collect SSH keys:

```bash
bash collect_ssh_keys.sh
```

4. **Distribute SSH Keys (Run Once per container):**
- Proceed with this step only after having completed STEP 3 on **each** container.
- In **each** Docker container, execute the following command to distribute SSH keys:

```bash
bash distribute_ssh_keys.sh
```

5. **Verify SSH Communication:**
- After running the previous steps, the containers should be able to communicate with each other using SSH.
- Use the `lsof -i` command to verify if the selected SSH port (usually 4022) is listening:

```bash
lsof -i
```

A successful output will look similar to this:

```
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sshd 74 root 3u IPv4 66262911 0t0 TCP *:4022 (LISTEN)
sshd 74 root 4u IPv6 66262913 0t0 TCP *:4022 (LISTEN)
```

- If the port is not listening, try changing the `PORT_ID` variable in both `collect_ssh_keys.sh` and `distribute_ssh_keys.sh` to a different port (e.g., 5022) and repeat STEPS 3 and 4.

6. **Run Multi-Server Training:**
- Refer to the following guide for instructions on running multi-server training commands on each worker node:

```
https://github.com/HabanaAI/Model-References/blob/master/PyTorch/generative_models/stable-diffusion/README.md#multi-server-training-examples
```
19 changes: 19 additions & 0 deletions utils/multi_server/collect_ssh_keys.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash
PORT_ID=4022
COMMON_PATH=/root/common
sed -i "s/#Port 22/Port ${PORT_ID}/g" /etc/ssh/sshd_config
sed -i "s/# Port 22/ Port ${PORT_ID}/g" /etc/ssh/ssh_config
sed -i "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config
service ssh restart

# Generate SSH key if it doesn't exist
if [ ! -f ~/.ssh/id_rsa ]; then
mkdir -p ~/.ssh
cd ~/.ssh
ssh-keygen -t rsa -b 4096 -f id_rsa -q -N ""
fi

# Append public key and make it accessible to all nodes
cat ~/.ssh/id_rsa.pub >> "${COMMON_PATH}/public_keys"
echo "Public key appended to ${COMMON_PATH}/public_keys ..."
echo "Exiting ..."
14 changes: 14 additions & 0 deletions utils/multi_server/distribute_ssh_keys.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
PORT_ID=4022
COMMON_PATH=/root/common
# Append public keys from all nodes to authorized_keys
cat "${COMMON_PATH}/public_keys" >> ~/.ssh/authorized_keys
service ssh restart
while read node; do
ssh-keyscan -p ${PORT_ID} -H ${node} >> ~/.ssh/known_hosts
done < "${COMMON_PATH}/nodes.txt"
#service ssh restart
echo "Added shared public_keys to authorized_keys and IPs to known_hosts ..."
echo "Checking Port status by running command 'lsof -i' ..."
lsof -i
echo "Exiting ..."
2 changes: 2 additions & 0 deletions utils/multi_server/nodes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
192.168.99.104
192.168.99.106