Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: pytorch单机多卡问题:ERROR: torch.distributed.elastic.multiprocessing.api:failed #3215

Closed
rabeisabigfool opened this issue Mar 23, 2023 · 30 comments
Labels
bug Something isn't working

Comments

@rabeisabigfool
Copy link

rabeisabigfool commented Mar 23, 2023

🐛 Describe the bug

File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The client socket has failed to connect to any network address of (i-0b9e876c, 57748). The IPv6 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known). The IPv4 network addresses of (i-0b9e876c, 57748) cannot be retrieved (gai error: -2 - Name or service not known).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51863) of binary: /home/whong/anaconda3/envs/chatgpt/bin/python
Traceback (most recent call last):
File "/home/whong/anaconda3/envs/chatgpt/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/whong/anaconda3/envs/chatgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./examples/train_reward_model.py FAILED

Failures:
[1]:
time : 2023-03-23_15:36:49
host : i-0B9E876C
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 51864)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-03-23_15:36:49
host : i-0B9E876C
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 51863)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

@rabeisabigfool rabeisabigfool added the bug Something isn't working label Mar 23, 2023
@JThh
Copy link
Contributor

JThh commented Mar 23, 2023

The port might have been occupied. Can you try running with a different port number?

@rabeisabigfool
Copy link
Author

The port might have been occupied. Can you try running with a different port number?

Ok, I'll give it a try.Thank you.

@rabeisabigfool
Copy link
Author

The port might have been occupied. Can you try running with a different port number?

Sorry, what do you mean by occupied port here?

@JThh
Copy link
Contributor

JThh commented Mar 24, 2023

The port number for which you launch the processes.

@rabeisabigfool
Copy link
Author

rabeisabigfool commented Mar 24, 2023

The port number for which you launch the processes.

I checked and found that none of the four GPU ports were occupied. Why is that? And after changing the port, the same error is still reported.

@JThh
Copy link
Contributor

JThh commented Mar 27, 2023

When using docker env to run, can you append --network=host to your command?

@scarydemon2
Copy link

same problem

@cauyxy
Copy link

cauyxy commented Mar 31, 2023

+1

@akk-123
Copy link

akk-123 commented Apr 2, 2023

same problem @JThh

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


same problem @JThh

@Honee-W
Copy link

Honee-W commented Apr 3, 2023

same problem. It seems that using single node single trainer is fine, but when nproc_per_node > 1, l got the same error.

@Honee-W
Copy link

Honee-W commented Apr 6, 2023

world_size = int(os.environ["WORLD_SIZE"])
mp.spawn(main_worker, args=(world_size, args), nprocs=world_size)
This is my main function to start distributed training, and when calling "spawn", it will pass an index aside from args to the function, in this case is main_worker, which should be defined like this:
def main_worker(i, world_size, args):
Then set device in main_worker function, and move the model to the device like this:
torch.cuda.set_device(i)
model = model.to(device)
I solved this problem by doing so, hope it can help.

@Youly172
Copy link

+1

@JThh
Copy link
Contributor

JThh commented Apr 14, 2023

Thanks @Honee-W for sharing. I understand the issue better now.

model = model.to(torch.cuda.get_currect_device()) would suffice. Would this be useful for you @Youly172 ?

@Youly172
Copy link

Youly172 commented Apr 14, 2023 via email

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The mail has been received~ Li Qiaoyan

@ifromeast
Copy link

+1

1 similar comment
@Ozawa333
Copy link

Ozawa333 commented May 8, 2023

+1

@Youly172
Copy link

Youly172 commented May 8, 2023 via email

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The mail has been received~ Li Qiaoyan

@Youly172
Copy link

Youly172 commented Jun 20, 2023 via email

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The mail has been received~ Li Qiaoyan

@ALLISWELL8
Copy link

这个问题我也碰到了,怎么解决的呀

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I also encountered this problem, how did I solve it?

@Youly172
Copy link

Youly172 commented Oct 13, 2023 via email

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The email has been received~Li Qiaoyan

@Yizhichaoai
Copy link

(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh
tcp://127.0.0.1:10031, ws:4, rank:0
tcp://127.0.0.1:10031, ws:4, rank:1
tcp://127.0.0.1:10031, ws:4, rank:2
tcp://127.0.0.1:10031, ws:4, rank:3


Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)


{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}


Creating dataset
Traceback (most recent call last):
File "train.py", line 557, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config))
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker
tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder)
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained
resolved_vocab_files[file_id] = cached_path(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path
output_path = get_from_cache(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀?????呜呜呜呜卡好久了
how to deal it?

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


(DGM4) root@autodl-container-602546be92-9be6991b:~/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main# sh train.sh
tcp://127.0.0.1:10031, ws:4, rank:0
tcp://127.0.0.1:10031, ws:4, rank:1
tcp://127.0.0.1:10031, ws:4, rank:2
tcp://127.0.0.1:10031, ws:4, rank:3


Namespace(checkpoint='ALBEF_4M.pth', config='configs/train.yaml', device='cuda', dist_backend='nccl', dist_url='tcp://127.0.0.1:10031', distributed=True, gpu=0, launcher='pytorch', log=True, log_num='20240729_231251', model_save_epoch=100, ngpus_per_node=4, output_dir='results', rank=0, resume=False, seed=777, text_encoder='bert-base-uncased', token_momentum=True, world_size=4)


{'train_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/train.json'], 'val_file': ['/root/autodl-tmp/my_projects/DGM4/datasets/DGM4/metadata/val.json'], 'bert_config': '/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/configs/config_bert.json', 'image_res': 256, 'vision_width': 768, 'embed_dim': 256, 'batch_size_train': 32, 'batch_size_val': 64, 'temp': 0.07, 'queue_size': 65536, 'momentum': 0.995, 'alpha': 0.4, 'max_words': 50, 'label_smoothing': 0.0, 'loss_MAC_wgt': 0.1, 'loss_BIC_wgt': 1, 'loss_bbox_wgt': 0.1, 'loss_giou_wgt': 0.1, 'loss_TMG_wgt': 1, 'loss_MLC_wgt': 1, 'optimizer': {'opt': 'adamW', 'lr': 2e-05, 'lr_img': 0.0001, 'weight_decay': 0.02}, 'schedular': {'sched': 'cosine', 'lr': 2e-05, 'epochs': 50, 'min_lr': 1e-06, 'decay_rate': 1, 'warmup_lr': 1e-06, 'warmup_epochs': 10, 'cooldown_epochs': 0}}


Creating dataset
Traceback (most recent call last):
File "train.py", line 557, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config))
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/autodl-tmp/my_projects/DGM4/MultiModal-DeepFake-main/train.py", line 369, in main_worker
tokenizer = BertTokenizerFast.from_pretrained(args.text_encoder)
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained
resolved_vocab_files[file_id] = cached_path(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1329, in cached_path
output_path = get_from_cache(
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/transformers/file_utils.py", line 1552, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

怎么解决呀?????呜呜呜呜卡好久了
how to deal it?

@Youly172
Copy link

Youly172 commented Jul 29, 2024 via email

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The email has been received~Li Qiaoyan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests