Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243

hexinw · 2024-07-29T20:43:22Z

This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework.

Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies.

With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test:

MASTER_ADDR
RANK
WORLD_SIZE

Dependencies

PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands:

cd /tmp/
wget
https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
unzip libtorch-shared-with-deps-latest.zip
sudo mv libtorch /usr/local/

Build instructions

To build the NCCL test binaries supporting both MPI and C10D Gloo, use:

MPI=1 GLOO=1 make

Usage

Run a Single 8-GPU Node NCCL Test:

Set environment variables:

export NCCL_TOPO_FILE=<topo_file_location>
export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH

Execute the test:

#!/bin/bash

for i in {0..7}; do
  MASTER_ADDR=localhost RANK=$i WORLD_SIZE=8 ./all_reduce_perf -b1G -e2G -f2 -t1 -g1 &
done

wait

Run a Two-Node NCCL Test:

Node 1:

Set environment variables:

export NCCL_TOPO_FILE=<topo_file_location>
export MASTER_ADDR=<master_node_ip_address>
export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH

Execute the test:

RANK=0 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8

Node 2:

Set environment variables:

export NCCL_TOPO_FILE=<topo_file_location>
export MASTER_ADDR=<master_node_ip_address>
export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH

Execute the test:

RANK=1 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8

This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework. Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies. With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test: - MASTER_ADDR - RANK - WORLD_SIZE >> Dependencies PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands: ``` cd /tmp/ wget https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip unzip libtorch-shared-with-deps-latest.zip sudo mv libtorch /usr/local/ ``` >> Build instructions To build the NCCL test binaries supporting both MPI and C10D Gloo, use: ``` MPI=1 GLOO=1 make ``` >> Usage >>>> Run a Single 8-GPU Node NCCL Test: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` #!/bin/bash for i in {0..7}; do MASTER_ADDR=localhost RANK=$i WORLD_SIZE=8 ./all_reduce_perf -b1G -e2G -f2 -t1 -g1 & done wait ``` >>>> Run a Two-Node NCCL Test: Node 1: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export MASTER_ADDR=<master_node_ip_address> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` RANK=0 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8 ``` Node 2: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export MASTER_ADDR=<master_node_ip_address> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` RANK=1 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8 ```

src/Makefile

AddyLaddy · 2024-07-29T23:29:57Z

src/Makefile

-NVCUFLAGS  := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11
-CXXFLAGS   := -std=c++11
+NVCUFLAGS  := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++17
+CXXFLAGS   := -std=c++17


I don't think we can force all users to move to c++17 just for this feature.

I agreed. I can feature-ize the compiling to C++17 only for GLOO.

AddyLaddy · 2024-07-29T23:32:01Z

src/common.cu

-    #ifdef MPI_SUPPORT
-      MPI_Barrier(MPI_COMM_WORLD);
-    #endif
+    if (!use_c10d_gloo) {


I don't understand why we need a boolean and these new if statements.
We normally build separate binaries for single node and then MPI=1 builds for multiple node.
I expected we'd have to build standalone, MPI=1 and GLOO=1 binaries

This boolean helps to enforce only one transport is picked at run time, if user ever builds with both MPI=1 and GLOO=1 in one single binary.

kiskra-nvidia · 2024-07-30T00:08:36Z

src/common.cu

+      auto options = c10d::ProcessGroupGloo::Options::create();
+      // Create Gloo device that binds to any interface.
+      ::gloo::transport::tcp::attr tcp_attr;
+      tcp_attr.iface = "eth0";


Why is the interface name hardcoded?

Good catch. I will fix it to be configurable by an env variable. Thanks.

Use "GLOO_INTERFACE" env to specify the network interface.

AddyLaddy reviewed Jul 29, 2024

View reviewed changes

src/Makefile Show resolved Hide resolved

AddyLaddy reviewed Jul 29, 2024

View reviewed changes

kiskra-nvidia reviewed Jul 30, 2024

View reviewed changes

Build GLOO with c++17, othereise the build remains c++11.

58565d7

Use "GLOO_INTERFACE" env to specify the network interface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243

Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243

hexinw commented Jul 29, 2024

AddyLaddy Jul 29, 2024

hexinw Jul 30, 2024

AddyLaddy Jul 29, 2024

hexinw Jul 30, 2024

kiskra-nvidia Jul 30, 2024

hexinw Jul 30, 2024

Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243

Are you sure you want to change the base?

Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243

Conversation

hexinw commented Jul 29, 2024

AddyLaddy Jul 29, 2024

Choose a reason for hiding this comment

hexinw Jul 30, 2024

Choose a reason for hiding this comment

AddyLaddy Jul 29, 2024

Choose a reason for hiding this comment

hexinw Jul 30, 2024

Choose a reason for hiding this comment

kiskra-nvidia Jul 30, 2024

Choose a reason for hiding this comment

hexinw Jul 30, 2024

Choose a reason for hiding this comment