-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #4 from mendygral/feature/dl-congestor
adding in a DL training congestor based off the Rice Data Science 201…
- Loading branch information
Showing
6 changed files
with
239 additions
and
130 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
|
||
This benchmark suite consists of two applications: | ||
|
||
network_test: Full system network tests in random and natural ring, alltoall | ||
network_test: Full system network tests in random and natural ring, alltoall | ||
and allreduce | ||
|
||
network_load_test: Select full system network tests run with four congestors to | ||
|
@@ -45,15 +45,15 @@ or | |
|
||
aprun -n 2304 -N 36 ./network_load_test | ||
|
||
Each application has no arguments. | ||
Each application has no arguments. | ||
|
||
# Benchmarking Practices | ||
|
||
GPCNeT applications should be run at full system scale, in particular | ||
network_load_test. network_test can be run at any scale above 2 nodes to | ||
measure the capability of a network for complex communication patterns. | ||
|
||
network_load_test should not be run at much less than full system scale | ||
network_load_test should not be run at much less than full system scale | ||
(ie, run on at least 95% of system nodes). The results | ||
will likely not be representative if the network has significant head room. Additionally, | ||
the spirit of this benchmark is that it is run with default network and MPI configuration | ||
|
@@ -62,8 +62,8 @@ network and MPI configuration is used, the baseline performance for communicatio | |
inspected with network_test prior to measuring congestion impacts with network_load_test. | ||
|
||
The primary tuning parameter users can use is the number of processes per NIC (PPN). | ||
We refer to process per NIC (rather than process per node) because modern nodes span a wide range of capabilities. | ||
Consider a dual socket node with 2 NICS and 6 GPUs vs a single socket single NIC node. | ||
We refer to process per NIC (rather than process per node) because modern nodes span a wide range of capabilities. | ||
Consider a dual socket node with 2 NICS and 6 GPUs vs a single socket single NIC node. | ||
The number of NICs is a reasonable proxy for expected communication capability. | ||
The higher the PPN the more the benchmark will push the network. For the network_test, | ||
higher PPN will push bandwidth per NIC (note the benchmark reports bandwidth per rank) | ||
|
@@ -95,7 +95,7 @@ congestors can be lessened by reducing the number of processes per NIC or | |
modifying the message sizes of congestors. | ||
|
||
Tuning of message sizes and loop counts is done with the defs at the beginning of | ||
network_test.c or network_load_test.c. For example, to modify the message size of | ||
network_test.c or network_load_test.c. For example, to modify the message size of | ||
of the one-sided incast look for this line in network_load_test.c | ||
|
||
#define INCAST_MSG_COUNT 512 | ||
|
@@ -109,3 +109,9 @@ Please contact any of the following people if you have any questions. | |
* Taylor Groves ([email protected]) | ||
* Sudheer Chunduri ([email protected]) | ||
* Pete Mendygral ([email protected]) | ||
|
||
# ChangeLog # | ||
|
||
4/3/2020: Adding in a new congestor type that mimics the allreduce operations for DL training. It is based on | ||
the paper presented on at 2019 Rice Data Science Conference | ||
https://2019datascienceconference.sched.com/event/UYuZ/sharing-resources-in-the-age-of-deep-learning |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.