Skip to content

ML4HPC/scalable_DNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

scalable_DNN

Scalable DNN

Both LSGD.py and CSGD.py are variation based on https://github.com/pytorch/examples/blob/master/imagenet/main.py. Hence, both needs torchvison to run.

CSGD.py is a conventional synchronous distributed SGD implementation for baseline test. LSGD.py is the Layered SGD implmentation.

Usage: When one node has four GPUs and 64 nodes (256 worker) are used in the running:

mpirun -np 320 python LSGD.py -a resnet50 --epoch 100 --batch-size 64 --gpu-num 4 --lr 6.4 [ImageNet path] mpirun -np 256 python CSGD.py -a resnet50 --epoch 100 --batch-size 64 --lr 6.4 [ImageNet path]

Since LSGD need one additional CPU communicator, it needs one additional MPI process per node. Since the number of node is 64, it should be 256 + 64 => 320. When you use 512 node having 8 GPUs, then the number of MPI process you need is 9*512 => 4608.

Releases

No releases published

Packages

No packages published

Languages