-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on Multi GPU Training #18
Comments
This issue might be setup related (see link1 and link2). I suggest to go through debugging steps indicated by the Pytorch docs: export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO and another run with export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL |
Hi @magehrig thanks for the reply. I have tried the debugging mentioned in the link you mentioned but nothing seems to work. The output with
The another run with export
|
Are you using NCCL or GLOO? |
Hi @mauk95 |
@magehrig I am using NCCL. Yes I tried GLOO as well but no success yet. |
Hi @Hatins I tried your suggestion, even set the BATCH_SIZE_PER_GPU=1 but same error. I am not sure what is the issue here. |
Sorry @mauk95, but this is really hard to debug since I cannot reproduce this. Have you successfully run other projects in Pytorch DDP mode on the same machine/cluster? If yes, you probably have to break the code down to a minimal working example and add complexity step by step to figure out where it breaks. |
Hi, I am getting the following error on running multi-gpu training on gen4 dataset using the command provided in the README instructions:
�[34m�[1mwandb�[39m�[22m: logging graph, to disable use
wandb.watch(log_graph=False)Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default
ModelSummarycallback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs
Trainer(limit_train_batches=1.0)was configured so 100% of the batches per epoch will be used..
Trainer(limit_val_batches=1.0)was configured so 100% of the batches will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 [2023-07-14 17:56:00,361][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2023-07-14 17:56:10,371][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:20,380][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:30,384][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:40,389][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:50,395][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:00,404][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:10,413][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:20,416][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:30,422][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:40,427][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:50,430][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:00,433][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:10,440][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:20,442][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:30,445][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:40,447][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:50,450][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:00,459][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:10,461][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:20,472][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:30,474][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:40,478][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:50,481][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:00,487][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:10,493][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:20,503][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:30,512][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:40,521][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:50,529][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:00,533][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:10,536][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:20,541][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:30,542][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:40,546][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:50,548][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:00,554][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:10,558][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:20,562][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:30,567][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:40,569][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:50,573][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:00,577][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:10,583][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:20,588][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:30,615][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:40,617][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:50,627][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:00,635][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:10,641][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:20,646][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:30,649][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:40,660][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:50,661][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:00,667][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:10,671][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:20,682][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:30,685][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:40,690][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:50,696][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:00,701][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:10,707][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:20,711][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:30,715][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:40,723][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:50,725][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:00,726][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:10,735][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:20,736][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:30,741][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:40,750][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:50,752][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:00,754][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:10,764][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:20,771][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:30,772][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:40,777][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:50,780][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:00,789][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:10,799][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:20,803][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:30,811][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:40,813][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:50,816][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:00,827][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:10,828][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:20,836][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:30,837][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:40,840][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:50,841][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:00,845][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:10,853][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:20,856][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:30,860][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:40,864][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:50,869][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:00,875][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:10,878][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:20,889][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:30,898][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:40,905][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:50,907][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:00,911][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:10,918][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:20,923][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:30,932][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:40,939][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:50,949][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:00,956][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:10,964][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:20,972][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:30,979][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:40,984][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:50,988][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:00,994][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:10,998][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:21,005][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:31,010][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:41,020][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:51,024][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:01,029][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:11,035][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:21,040][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:31,044][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:41,051][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:51,054][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:01,059][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:11,063][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:21,067][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:31,077][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:41,080][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:51,086][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:01,087][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:11,092][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:21,096][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:31,102][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:41,107][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:51,110][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:01,113][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:11,117][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:21,123][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:31,128][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:41,133][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:51,140][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:01,144][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:11,150][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:21,154][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:31,160][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:41,164][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:51,166][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:01,169][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:11,176][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:21,179][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:31,182][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:41,193][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:51,201][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:01,210][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:11,220][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:21,224][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:31,230][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:41,239][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:51,250][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:01,256][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:11,260][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:21,263][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:31,265][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:41,275][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:51,280][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:01,283][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:11,291][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:21,300][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:31,309][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:41,318][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:51,328][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:01,334][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:11,340][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:21,346][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:31,349][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:41,359][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:51,360][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Error executing job with overrides: ['model=rnndet', 'dataset=gen4', 'dataset.path=/netscratch/mukhan/thesis/Data/gen4/', 'wandb.project_name=RVT', 'wandb.group_name=1mpx', '+experiment/gen4=base.yaml', 'hardware.gpus=[0,1]', 'batch_size.train=12', 'batch_size.eval=12', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/netscratch/mukhan/RVT/train.py", line 138, in main trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run self.strategy.setup_environment() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment self.setup_distributed() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group _store_based_barrier(rank, store, timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 459, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I have ran the job on SLURM on 2 V100-32GB GPUS with --cpus-per-task=6. Please let me know what is the issue, thanks.
The text was updated successfully, but these errors were encountered: