Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on Multi GPU Training #18

Open
mauk95 opened this issue Jul 14, 2023 · 7 comments
Open

Error on Multi GPU Training #18

mauk95 opened this issue Jul 14, 2023 · 7 comments

Comments

@mauk95
Copy link

mauk95 commented Jul 14, 2023

Hi, I am getting the following error on running multi-gpu training on gen4 dataset using the command provided in the README instructions:

�[34m�[1mwandb�[39m�[22m: logging graph, to disable use wandb.watch(log_graph=False)Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a defaultModelSummarycallback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUsTrainer(limit_train_batches=1.0)was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 [2023-07-14 17:56:00,361][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2023-07-14 17:56:10,371][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:20,380][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:30,384][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:40,389][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:50,395][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:00,404][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:10,413][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:20,416][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:30,422][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:40,427][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:50,430][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:00,433][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:10,440][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:20,442][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:30,445][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:40,447][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:50,450][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:00,459][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:10,461][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:20,472][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:30,474][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:40,478][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:50,481][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:00,487][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:10,493][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:20,503][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:30,512][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:40,521][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:50,529][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:00,533][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:10,536][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:20,541][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:30,542][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:40,546][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:50,548][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:00,554][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:10,558][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:20,562][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:30,567][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:40,569][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:50,573][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:00,577][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:10,583][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:20,588][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:30,615][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:40,617][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:50,627][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:00,635][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:10,641][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:20,646][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:30,649][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:40,660][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:50,661][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:00,667][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:10,671][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:20,682][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:30,685][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:40,690][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:50,696][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:00,701][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:10,707][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:20,711][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:30,715][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:40,723][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:50,725][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:00,726][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:10,735][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:20,736][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:30,741][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:40,750][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:50,752][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:00,754][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:10,764][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:20,771][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:30,772][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:40,777][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:50,780][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:00,789][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:10,799][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:20,803][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:30,811][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:40,813][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:50,816][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:00,827][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:10,828][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:20,836][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:30,837][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:40,840][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:50,841][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:00,845][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:10,853][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:20,856][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:30,860][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:40,864][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:50,869][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:00,875][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:10,878][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:20,889][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:30,898][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:40,905][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:50,907][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:00,911][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:10,918][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:20,923][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:30,932][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:40,939][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:50,949][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:00,956][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:10,964][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:20,972][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:30,979][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:40,984][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:50,988][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:00,994][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:10,998][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:21,005][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:31,010][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:41,020][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:51,024][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:01,029][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:11,035][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:21,040][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:31,044][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:41,051][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:51,054][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:01,059][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:11,063][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:21,067][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:31,077][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:41,080][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:51,086][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:01,087][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:11,092][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:21,096][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:31,102][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:41,107][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:51,110][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:01,113][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:11,117][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:21,123][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:31,128][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:41,133][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:51,140][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:01,144][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:11,150][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:21,154][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:31,160][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:41,164][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:51,166][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:01,169][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:11,176][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:21,179][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:31,182][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:41,193][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:51,201][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:01,210][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:11,220][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:21,224][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:31,230][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:41,239][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:51,250][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:01,256][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:11,260][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:21,263][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:31,265][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:41,275][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:51,280][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:01,283][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:11,291][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:21,300][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:31,309][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:41,318][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:51,328][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:01,334][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:11,340][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:21,346][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:31,349][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:41,359][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:51,360][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Error executing job with overrides: ['model=rnndet', 'dataset=gen4', 'dataset.path=/netscratch/mukhan/thesis/Data/gen4/', 'wandb.project_name=RVT', 'wandb.group_name=1mpx', '+experiment/gen4=base.yaml', 'hardware.gpus=[0,1]', 'batch_size.train=12', 'batch_size.eval=12', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/netscratch/mukhan/RVT/train.py", line 138, in main trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run self.strategy.setup_environment() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment self.setup_distributed() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group _store_based_barrier(rank, store, timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 459, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I have ran the job on SLURM on 2 V100-32GB GPUS with --cpus-per-task=6. Please let me know what is the issue, thanks.

@magehrig
Copy link
Contributor

This issue might be setup related (see link1 and link2).

I suggest to go through debugging steps indicated by the Pytorch docs:
Please show the output of running your command with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO

and another run with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL

@mauk95
Copy link
Author

mauk95 commented Jul 16, 2023

This issue might be setup related (see link1 and link2).

I suggest to go through debugging steps indicated by the Pytorch docs: Please show the output of running your command with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO

and another run with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL

Hi @magehrig thanks for the reply. I have tried the debugging mentioned in the link you mentioned but nothing seems to work.

The output with export TORCH_DISTRIBUTED_DEBUG=INFO is as follows:

�[34m�[1mwandb�[39m�[22m: logging graph, to disable use wandb.watch(log_graph=False)Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a defaultModelSummarycallback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUsTrainer(limit_train_batches=1.0)was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 [2023-07-16 12:52:05,230][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2023-07-16 12:52:15,236][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:25,246][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:35,254][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:45,264][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:55,270][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:05,277][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:15,282][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:25,293][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:35,302][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:45,307][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:55,311][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:05,314][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:15,319][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:25,324][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:35,334][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:45,345][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:55,351][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:05,353][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:15,356][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:25,367][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:35,368][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:45,372][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:55,377][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:05,382][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:15,391][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:25,396][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:35,403][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:45,405][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:55,406][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:05,407][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:15,415][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:25,422][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:35,433][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:45,441][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:55,451][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:05,459][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:15,462][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:25,467][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:35,477][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:45,483][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:55,485][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:05,487][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:15,497][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:25,508][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:35,512][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:45,516][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:55,518][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:05,522][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:15,526][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:25,530][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:35,541][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:45,550][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:55,553][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:05,558][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:15,568][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:25,575][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:35,578][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:45,581][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:55,592][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:05,601][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:15,603][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:25,606][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:35,611][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:45,621][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:55,626][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:05,632][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:15,637][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:25,649][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:35,653][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:45,657][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:55,665][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:05,666][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:15,668][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:25,670][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:35,675][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:45,685][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:55,688][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:05,692][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:15,696][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:25,703][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:35,704][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:45,713][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:55,717][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:05,723][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:15,725][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:25,728][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:35,733][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:45,739][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:55,748][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:05,751][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:15,763][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:25,772][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:35,778][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:45,789][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:55,793][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:05,797][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:15,801][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:25,809][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:35,811][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:45,819][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:55,823][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:05,830][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:15,832][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:25,839][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:35,850][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:45,854][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:55,855][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:05,864][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:15,869][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:25,874][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:35,875][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:45,878][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:55,886][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:05,896][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:15,899][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:25,908][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:35,910][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:45,919][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:55,929][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:05,934][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:15,938][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:25,945][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:35,950][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:45,955][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:55,960][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:05,963][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:15,966][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:25,971][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:35,979][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:45,986][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:55,996][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:06,004][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:16,012][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:26,021][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:36,026][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:46,034][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:56,037][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:06,045][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:16,049][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:26,059][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:36,069][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:46,075][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:56,082][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:06,084][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:16,092][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:26,103][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:36,112][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:46,119][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:56,122][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:06,127][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:16,137][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:26,140][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:36,147][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:46,156][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:56,163][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:06,170][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:16,175][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:26,186][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:36,192][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:46,197][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:56,200][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:06,210][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:16,214][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:26,224][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:36,233][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:46,238][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:56,243][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:06,253][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:16,255][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:26,266][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:36,272][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:46,279][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:56,289][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:06,296][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:16,300][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:26,308][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:36,316][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:46,321][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:56,326][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Error executing job with overrides: ['model=rnndet', 'dataset=gen4', 'dataset.path=/netscratch/mukhan/thesis/Data/gen4/', 'wandb.project_name=RVT', 'wandb.group_name=1mpx', '+experiment/gen4=base.yaml', 'hardware.gpus=[0,1]', 'batch_size.train=12', 'batch_size.eval=12', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/netscratch/mukhan/RVT/train.py", line 143, in main benchmark=config.reproduce.benchmark, File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run self.strategy.setup_environment() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment self.setup_distributed() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group _store_based_barrier(rank, store, timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 459, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

The another run with export TORCH_DISTRIBUTED_DEBUG=DETAIL gives the following output:

�[34m�[1mwandb�[39m�[22m: logging graph, to disable use wandb.watch(log_graph=False)Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a defaultModelSummarycallback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUsTrainer(limit_train_batches=1.0)was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Error executing job with overrides: ['model=rnndet', 'dataset=gen4', 'dataset.path=/netscratch/mukhan/thesis/Data/gen4/', 'wandb.project_name=RVT', 'wandb.group_name=1mpx', '+experiment/gen4=base.yaml', 'hardware.gpus=[0,1]', 'batch_size.train=12', 'batch_size.eval=12', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/netscratch/mukhan/RVT/train.py", line 143, in main benchmark=config.reproduce.benchmark, File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run self.strategy.setup_environment() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment self.setup_distributed() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group default_pg = _new_process_group_helper( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1064, in _new_process_group_helper backend_class = _create_process_group_wrapper( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3400, in _create_process_group_wrapper helper_pg = ProcessGroupGloo(store, rank, world_size, timeout=timeout) RuntimeError: Socket Timeout Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@magehrig
Copy link
Contributor

Are you using NCCL or GLOO?
Have you tried both?

@Hatins
Copy link

Hatins commented Jul 18, 2023

Hi @mauk95
I have met the same problem when using multi-gpu for training (but not in RVT). I found this problem is caused by the lack of GPU memory, further affect the communication between the GPUs. So may you could decrease the number of the batch_size and then try again,

@mauk95
Copy link
Author

mauk95 commented Jul 18, 2023

Are you using NCCL or GLOO? Have you tried both?

@magehrig I am using NCCL. Yes I tried GLOO as well but no success yet.

@mauk95
Copy link
Author

mauk95 commented Jul 18, 2023

Hi @mauk95 I have met the same problem when using multi-gpu for training (but not in RVT). I found this problem is caused by the lack of GPU memory, further affect the communication between the GPUs. So may you could decrease the number of the batch_size and then try again,

Hi @Hatins I tried your suggestion, even set the BATCH_SIZE_PER_GPU=1 but same error. I am not sure what is the issue here.

@magehrig
Copy link
Contributor

magehrig commented Jul 18, 2023

Sorry @mauk95, but this is really hard to debug since I cannot reproduce this. Have you successfully run other projects in Pytorch DDP mode on the same machine/cluster? If yes, you probably have to break the code down to a minimal working example and add complexity step by step to figure out where it breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants