-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
118 lines (92 loc) · 7.66 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# BEGIN_ICS_COPYRIGHT8 ****************************************
#
# Copyright (c) 2021, Cornelis Networks
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of Cornelis Networks nor the names of its contributors
# may be used to endorse or promote products derived from this software
# without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# END_ICS_COPYRIGHT8 ****************************************
#[ICS VERSION STRING: unknown]
README Contents:
- DEPENDENCIES
- BUILDING
- RUNNING
- PSM2-NCCL PLUGIN GENERAL USAGE NOTES
- GPUDirect Support
- PSM2-NCCL and PSM2 Multi-Endpoint
- PSM2 Endpoint Counts
- Plugin Logging
DEPENDENCIES
============
* libpsm2 w/NCCL support, PSM2 v12.0/PSM2_11.2.229 or newer
- libpsm2 should be built with CUDA support. See opa-psm2/README for how to build with CUDA support
* hfi1 w/NVIDIA support
* CUDA 12.3 or later installed on GPU nodes
* NCCL 2.20.5 or later installed on GPU nodes
BUILDING
========
Change directory (cd) to the psm2-nccl clone and run "make". This will build `${BUILDDIR}/libnccl-net.so`.
MAKE VARIABLES
==============
Here are some important/useful variables for controlling the build behavior:
- `BUILDDIR` [default='.']: control where generated libnccl-net.so is placed.
- `PSM2_INCLUDE` [default=`/usr/include`]: directory containing PSM2 header files.
- `PSM2_LIB` [default=`/usr/lib64`]: directory containing PSM2 dynamic library file.
- `DEBUG` [default=`0`]: when `1`, build PSM2-NCCL plugin with debug-info and debug-level optimization. Otherwise, build without debug info and higher optimization level.
See Makefile for other directory and build variables.
RUNNING
=======
Add the directory containing psm2-nccl's libnccl-net.so to LD_LIBRARY_PATH and ensure LD_LIBRARY_PATH is set in the job (i.e., of all ranks') environment.
Run your NCCL application using Open MPI mpirun, e.g.,:
mpirun -host <host1>:2,<host2>:2 -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -x PSM2_MULTI_EP=1 -x LD_LIBRARY_PATH -x NCCL_NET_GDR_LEVEL=5 build/all_reduce_perf
PSM2-NCCL only supports Open MPI at present. Some parts of PSM2 depend on the MPI_LOCALNRANKS and MPI_LOCALRANKID environment variables and will perform sub-optimally if those variables are not set.
You may use CUDA-aware or CUDA-naive Open MPI but per the NCCL User Guide (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi), using NCCL and CUDA-aware MPI operations concurrently may cause deadlocks.
PSM2 does not support more than one GPU per host process. Similarly, PSM2-NCCL does not support more than one GPU per host process.
PSM2-NCCL PLUGIN GENERAL USAGE NOTES
====================================
GPUDirect Support
-------------------------------
The PSM2-NCCL plugin has GPUDirect support enabled by default. This default should not be changed.
Restrictions:
* NCCL_NET_GDR_READ=0 must be set in job environment to disable GPUDirect on the send-side. Not doing so may result in deadlocks.
* NCCL_PROTO=^LL128 must be set in job environment to prevent NCCL from using LL128 with the PSM2-NCCL plugin. Not doing so may result in silent data corruption.
* NCCL_CUMEM_ENABLE=0 must be set for NCCL 2.19 and newer. Not doing so may result in an abort in libpsm2.
PSM2-NCCL and PSM2 Multi-Endpoint
----------------------------------
Each NCCL rank (GPU) in a NCCL job typically requires one send and one receive NCCL communicator object per remote (i.e., not on same host) NCCL peer-rank.
PSM2-NCCL uses one PSM2 endpoint for all NCCL send and receive communicator objects associated with a host rank. This is called PSM2-NCCL shared-EP mode. This behavior can be changed with the PSM2_NCCL_SHARED_EP environment variable.
PSM2 by default does not allow more than one PSM2 endpoint open at a time within a process. Attempts to open additional endpoints will fail on psm2_ep_open() with the error 'PSM2_TOO_MANY_ENDPOINTS'. This can happen when running a NCCL job with PSM2-NCCL in Open MPI where Open MPI also uses PSM2 directly (PSM2 MTL) or indirectly (via OFI MTL).
This problem can be solved by enabling PSM2 multi-endpoint with PSM2_MULTI_EP=1. Enabling PSM2 multi-endpoint disables PSM2 context-sharing.
PSM2 Endpoint Counts
----------------------------------
The number of PSM2 endpoints on a node is limited by the number of hfi1 contexts. Each PSM2 endpoint requires an hfi1 context. hfi1 will typically have as many contexts as there are physical CPU cores on the system. The number of hfi1 contexts can be increased up to a limit of 160 using the 'num_user_contexts' kernel module parameter.
PSM2 by default uses a feature called context-sharing that allows up to eight PSM2 endpoints to share one hfi1 context. Enabling PSM2 multi-endpoint disables PSM2 context-sharing.
For a NCCL job using PSM2-NCCL, being run through Open MPI with PSM2, and PSM2-NCCL is in shared-EP mode (the default), each host rank will require two PSM2 endpoints; consequently, PSM2_MULTI_EP=1 must be set in the job environments. Recall that PSM2-NCCL can only handle one GPU per host process, so M host processes are requires for M NCCL ranks (GPUs) on a node in a job. Assuming the hfi1 default of num_user_contexts=<num physical cores>, a system with 36 physical CPU cores could host 18 (PSM2-NCCL, OMPI) ranks => 18 NCCL ranks (GPUs).
HFI Selection
----------------------------------
NCCL may sometimes not select the correct HFI in multi-HFI environments. For example, NCCL may select an HFI across NUMA or socket boundaries from the GPU if there are no HFIs and GPUs within acceptable distances according to NCCL (NCCL_NET_GDR_LEVEL). In order to force NCCL to use the HFI defined by HFI_UNIT, set PSM2_NCCL_USE_NCCL_DEV=0 and set HFI_UNIT to select the HFI closest to the GPU. The HFI selected by NCCL can be determined with NCCL_DEBUG logging (see the "Plugin Logging" section of this README).
PSM2_NCCL_USE_NCCL_DEV=0 may need to be used with NCCL_NET_GDR_LEVEL=5 if the system topology does not have the HFI and GPU behind the same NUMA node or PCIe switch. NCCL_NET_GDR_LEVEL determines the maximum (GPU,HFI) distance for which NCCL will use DMA, i.e. pass a device pointer to PSM2-NCCL instead of using a bounce buffer.
Plugin Logging
--------------
PSM2-NCCL uses NCCL's logging system for debug output. All PSM2-NCCL output is logged under the NCCL 'NET' logging subsystem. To get PSM2-NCCL debug output, add '-x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x PSM2_NCCL_LOG_LEVEL=2' to your mpirun line.
See src/include/psm2_nccl_debug.h for PSM2-NCCL log levels.