Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hexinw
Copy link

@hexinw hexinw commented Jul 29, 2024

This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework.

Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies.

With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test:

  • MASTER_ADDR
  • RANK
  • WORLD_SIZE

Dependencies

PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands:

cd /tmp/
wget
https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
unzip libtorch-shared-with-deps-latest.zip
sudo mv libtorch /usr/local/

Build instructions

To build the NCCL test binaries supporting both MPI and C10D Gloo, use:

MPI=1 GLOO=1 make

Usage

Run a Single 8-GPU Node NCCL Test:

  1. Set environment variables:
export NCCL_TOPO_FILE=<topo_file_location>
export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  1. Execute the test:
#!/bin/bash

for i in {0..7}; do
  MASTER_ADDR=localhost RANK=$i WORLD_SIZE=8 ./all_reduce_perf -b1G -e2G -f2 -t1 -g1 &
done

wait

Run a Two-Node NCCL Test:

Node 1:

  1. Set environment variables:
export NCCL_TOPO_FILE=<topo_file_location>
export MASTER_ADDR=<master_node_ip_address>
export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  1. Execute the test:
RANK=0 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8

Node 2:

  1. Set environment variables:
export NCCL_TOPO_FILE=<topo_file_location>
export MASTER_ADDR=<master_node_ip_address>
export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  1. Execute the test:
RANK=1 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8

This patch introduces support for running multi-process, multi-node NCCL
tests using the Torch c10d Gloo distributed framework.

Previously, running multi-node NCCL tests required MPI, which relies on
SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed
deployment and security challenges due to the need for maintaining SSH
keys or Kubexec RBAC policies.

With the introduction of C10D Gloo, worker nodes now communicate with
the master node over TCP transport. This simplifies the process, making
it similar to running multi-node PyTorch training jobs. Users only need
to set the following environment variables to start the test:

- MASTER_ADDR
- RANK
- WORLD_SIZE

>> Dependencies

PyTorch C++ APIs and libraries are required. Download LibTorch with the
following commands:

  ```
  cd /tmp/
  wget
  https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
  unzip libtorch-shared-with-deps-latest.zip
  sudo mv libtorch /usr/local/
  ```

>> Build instructions

To build the NCCL test binaries supporting both MPI and C10D Gloo, use:

  ```
  MPI=1 GLOO=1 make
  ```

>> Usage

>>>> Run a Single 8-GPU Node NCCL Test:

1. Set environment variables:

  ```
  export NCCL_TOPO_FILE=<topo_file_location>
  export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  ```

2. Execute the test:

  ```
  #!/bin/bash

  for i in {0..7}; do
    MASTER_ADDR=localhost RANK=$i WORLD_SIZE=8 ./all_reduce_perf -b1G -e2G -f2 -t1 -g1 &
  done

  wait
  ```

>>>> Run a Two-Node NCCL Test:

Node 1:

1. Set environment variables:

  ```
  export NCCL_TOPO_FILE=<topo_file_location>
  export MASTER_ADDR=<master_node_ip_address>
  export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  ```

2. Execute the test:

  ```
  RANK=0 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8
  ```

Node 2:

1. Set environment variables:

  ```
  export NCCL_TOPO_FILE=<topo_file_location>
  export MASTER_ADDR=<master_node_ip_address>
  export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  ```

2. Execute the test:

  ```
  RANK=1 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8
  ```
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11
CXXFLAGS := -std=c++11
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++17
CXXFLAGS := -std=c++17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can force all users to move to c++17 just for this feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agreed. I can feature-ize the compiling to C++17 only for GLOO.

#ifdef MPI_SUPPORT
MPI_Barrier(MPI_COMM_WORLD);
#endif
if (!use_c10d_gloo) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need a boolean and these new if statements.
We normally build separate binaries for single node and then MPI=1 builds for multiple node.
I expected we'd have to build standalone, MPI=1 and GLOO=1 binaries

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This boolean helps to enforce only one transport is picked at run time, if user ever builds with both MPI=1 and GLOO=1 in one single binary.

src/common.cu Outdated
auto options = c10d::ProcessGroupGloo::Options::create();
// Create Gloo device that binds to any interface.
::gloo::transport::tcp::attr tcp_attr;
tcp_attr.iface = "eth0";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the interface name hardcoded?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will fix it to be configurable by an env variable. Thanks.

Use "GLOO_INTERFACE" env to specify the network interface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants