[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed #3215

rabeisabigfool · 2023-03-23T07:45:42Z

🐛 Describe the bug

File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The client socket has failed to connect to any network address of (i-0b9e876c, 57748). The IPv6 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known). The IPv4 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51863) of binary: /home/whong/anaconda3/envs/chatgpt/bin/python
Traceback (most recent call last):
File "/home/whong/anaconda3/envs/chatgpt/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./examples/train_reward_model.py FAILED

Failures:
[1]:
time : 2023-03-23_15:36:49
host : i-0B9E876C
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 51864)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-03-23_15:36:49
host : i-0B9E876C
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 51863)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

JThh · 2023-03-23T09:31:24Z

The port might have been occupied. Can you try running with a different port number?

rabeisabigfool · 2023-03-24T01:46:24Z

Ok, I'll give it a try.Thank you.

rabeisabigfool · 2023-03-24T06:51:38Z

Sorry, what do you mean by occupied port here?

JThh · 2023-03-24T07:20:06Z

The port number for which you launch the processes.

rabeisabigfool · 2023-03-24T07:59:23Z

I checked and found that none of the four GPU ports were occupied. Why is that? And after changing the port, the same error is still reported.

JThh · 2023-03-27T06:00:34Z

When using docker env to run, can you append --network=host to your command?

scarydemon2 · 2023-03-31T02:49:46Z

same problem

cauyxy · 2023-03-31T13:27:52Z

+1

akk-123 · 2023-04-02T15:08:05Z

same problem @JThh

Issues-translate-bot · 2023-04-02T15:08:17Z

same problem @JThh

Honee-W · 2023-04-03T08:45:06Z

same problem. It seems that using single node single trainer is fine, but when nproc_per_node > 1, l got the same error.

Honee-W · 2023-04-06T01:13:06Z

world_size = int(os.environ["WORLD_SIZE"])
mp.spawn(main_worker, args=(world_size, args), nprocs=world_size)
This is my main function to start distributed training, and when calling "spawn", it will pass an index aside from args to the function, in this case is main_worker, which should be defined like this:
def main_worker(i, world_size, args):
Then set device in main_worker function, and move the model to the device like this:
torch.cuda.set_device(i)
model = model.to(device)
I solved this problem by doing so, hope it can help.

Youly172 · 2023-04-11T15:47:07Z

+1

JThh · 2023-04-14T03:09:51Z

Thanks @Honee-W for sharing. I understand the issue better now.

model = model.to(torch.cuda.get_currect_device()) would suffice. Would this be useful for you @Youly172 ?

Youly172 · 2023-04-14T03:10:11Z

邮件已收到~李巧艳

Issues-translate-bot · 2023-04-14T03:10:22Z

The mail has been received~ Li Qiaoyan

ifromeast · 2023-04-18T11:54:33Z

+1

Ozawa333 · 2023-05-08T14:40:59Z

+1

Youly172 · 2023-05-08T14:41:21Z

邮件已收到~李巧艳

Issues-translate-bot · 2023-05-08T14:41:35Z

The mail has been received~ Li Qiaoyan

Youly172 · 2023-06-20T09:14:51Z

邮件已收到~李巧艳

Issues-translate-bot · 2023-06-20T09:15:02Z

The mail has been received~ Li Qiaoyan

ALLISWELL8 · 2023-10-13T08:35:47Z

这个问题我也碰到了，怎么解决的呀

Issues-translate-bot · 2023-10-13T08:35:58Z

I also encountered this problem, how did I solve it?

Youly172 · 2023-10-13T08:36:06Z

邮件已收到~李巧艳

Issues-translate-bot · 2023-10-13T08:36:19Z

The email has been received~Li Qiaoyan

Yizhichaoai · 2024-07-29T15:16:12Z

(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh
tcp://127.0.0.1:10031, ws:4, rank:0
tcp://127.0.0.1:10031, ws:4, rank:1
tcp://127.0.0.1:10031, ws:4, rank:2
tcp://127.0.0.1:10031, ws:4, rank:3

Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)

{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}

Creating dataset
Traceback (most recent call last):
File "train.py", line 557, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config))
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker
tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder)
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained
resolved_vocab_files[file_id] = cached_path(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path
output_path = get_from_cache(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀？？？？？呜呜呜呜卡好久了
how to deal it？

Issues-translate-bot · 2024-07-29T15:16:23Z

(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh
tcp://127.0.0.1:10031, ws:4, rank:0
tcp://127.0.0.1:10031, ws:4, rank:1
tcp://127.0.0.1:10031, ws:4, rank:2
tcp://127.0.0.1:10031, ws:4, rank:3

Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)

{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}

Creating dataset
Traceback (most recent call last):
File "train.py", line 557, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config))
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker
tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder)
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained
resolved_vocab_files[file_id] = cached_path(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path
output_path = get_from_cache(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀？？？？？呜呜呜呜卡好久了
how to deal it？

Youly172 · 2024-07-29T15:16:46Z

邮件已收到~李巧艳

Issues-translate-bot · 2024-07-29T15:16:57Z

The email has been received~Li Qiaoyan

rabeisabigfool added the bug Something isn't working label Mar 23, 2023

flybird11111 closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed #3215

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed #3215

rabeisabigfool commented Mar 23, 2023 •

edited

Loading

JThh commented Mar 23, 2023

rabeisabigfool commented Mar 24, 2023

rabeisabigfool commented Mar 24, 2023

JThh commented Mar 24, 2023

rabeisabigfool commented Mar 24, 2023 •

edited

Loading

JThh commented Mar 27, 2023

scarydemon2 commented Mar 31, 2023

cauyxy commented Mar 31, 2023

akk-123 commented Apr 2, 2023

Issues-translate-bot commented Apr 2, 2023

Honee-W commented Apr 3, 2023

Honee-W commented Apr 6, 2023 •

edited

Loading

Youly172 commented Apr 11, 2023

JThh commented Apr 14, 2023

Youly172 commented Apr 14, 2023 via email

Issues-translate-bot commented Apr 14, 2023

ifromeast commented Apr 18, 2023

Ozawa333 commented May 8, 2023

Youly172 commented May 8, 2023 via email

Issues-translate-bot commented May 8, 2023

Youly172 commented Jun 20, 2023 via email

Issues-translate-bot commented Jun 20, 2023

ALLISWELL8 commented Oct 13, 2023

Issues-translate-bot commented Oct 13, 2023

Youly172 commented Oct 13, 2023 via email

Issues-translate-bot commented Oct 13, 2023

Yizhichaoai commented Jul 29, 2024

Issues-translate-bot commented Jul 29, 2024

Youly172 commented Jul 29, 2024 via email

Issues-translate-bot commented Jul 29, 2024

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed #3215

[BUG]: pytorch单机多卡问题：ERROR: torch.distributed.elastic.multiprocessing.api:failed #3215

Comments

rabeisabigfool commented Mar 23, 2023 • edited Loading

🐛 Describe the bug

./examples/train_reward_model.py FAILED

Failures: [1]: time : 2023-03-23_15:36:49 host : i-0B9E876C rank : 1 (local_rank: 1) exitcode : 1 (pid: 51864) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

JThh commented Mar 23, 2023

rabeisabigfool commented Mar 24, 2023

rabeisabigfool commented Mar 24, 2023

JThh commented Mar 24, 2023

rabeisabigfool commented Mar 24, 2023 • edited Loading

JThh commented Mar 27, 2023

scarydemon2 commented Mar 31, 2023

cauyxy commented Mar 31, 2023

akk-123 commented Apr 2, 2023

Issues-translate-bot commented Apr 2, 2023

Honee-W commented Apr 3, 2023

Honee-W commented Apr 6, 2023 • edited Loading

Youly172 commented Apr 11, 2023

JThh commented Apr 14, 2023

Youly172 commented Apr 14, 2023 via email

Issues-translate-bot commented Apr 14, 2023

ifromeast commented Apr 18, 2023

Ozawa333 commented May 8, 2023

Youly172 commented May 8, 2023 via email

Issues-translate-bot commented May 8, 2023

Youly172 commented Jun 20, 2023 via email

Issues-translate-bot commented Jun 20, 2023

ALLISWELL8 commented Oct 13, 2023

Issues-translate-bot commented Oct 13, 2023

Youly172 commented Oct 13, 2023 via email

Issues-translate-bot commented Oct 13, 2023

Yizhichaoai commented Jul 29, 2024

Issues-translate-bot commented Jul 29, 2024

Youly172 commented Jul 29, 2024 via email

Issues-translate-bot commented Jul 29, 2024

rabeisabigfool commented Mar 23, 2023 •

edited

Loading

Failures:
[1]:
time : 2023-03-23_15:36:49
host : i-0B9E876C
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 51864)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

rabeisabigfool commented Mar 24, 2023 •

edited

Loading

Honee-W commented Apr 6, 2023 •

edited

Loading