From b0e062b34172a38c9bb94d3302bcb5f18918414d Mon Sep 17 00:00:00 2001 From: Xiaoming Zhao Date: Tue, 3 Dec 2024 01:06:00 -0800 Subject: [PATCH] Fixed Issue for `torchrun` command for `train_cifar10_ddp.py` (#149) * Fixed global_step in train_cifar10_ddp.py * fixed torchrun command for train_cifar10_ddp.py * Update train_cifar10_ddp.py --- examples/images/cifar10/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/images/cifar10/README.md b/examples/images/cifar10/README.md index 4d1e3ae..edc5a53 100644 --- a/examples/images/cifar10/README.md +++ b/examples/images/cifar10/README.md @@ -26,10 +26,10 @@ python3 train_cifar10.py --model "icfm" --lr 2e-4 --ema_decay 0.9999 --batch_siz python3 train_cifar10.py --model "fm" --lr 2e-4 --ema_decay 0.9999 --batch_size 128 --total_steps 400001 --save_step 20000 ``` -Note that you can train all our methods in parallel using multiple GPUs and DistributedDataParallel. You can do this by providing the number of GPUs, setting the parallel flag to True and providing the master address and port in the command line. As an example: +Note that you can train all our methods in parallel using multiple GPUs and DistributedDataParallel. You can do this by providing the number of GPUs, setting the parallel flag to True and providing the master address and port in the command line. Please refer to [the official document for the usage](https://pytorch.org/docs/stable/elastic/run.html#usage). As an example: ```bash -torchrun --nproc_per_node=NUM_GPUS_YOU_HAVE train_cifar10_ddp.py --model "otcfm" --lr 2e-4 --ema_decay 0.9999 --batch_size 128 --total_steps 400001 --save_step 20000 --parallel True --master_addr "MASTER_ADDR" --master_port "MASTER_PORT" +torchrun --standalone --nnodes=1 --nproc_per_node=NUM_GPUS_YOU_HAVE train_cifar10_ddp.py --model "otcfm" --lr 2e-4 --ema_decay 0.9999 --batch_size 128 --total_steps 400001 --save_step 20000 --parallel True --master_addr "MASTER_ADDR" --master_port "MASTER_PORT" ``` To compute the FID from the OT-CFM model at end of training, run: