You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to train the librimix recipe code and I'm getting the same error when I try to use my GPU for training:
RuntimeError("Distributed package doesn't have NCCL " "built in")
torch.cuda.current_device() is returning a GPU named 0 in my python but when I enter it like this:
./run.sh --stage 2 --id 0
to train my model with the gpu it returns that runtime error.
Is it necessary to have NCCL in my systemto train the example? Or is it only that I'm making an error in the process of training.
This is my complete output in order anyone can help me:
Results from the following experiment will be stored in exp/train_convtasnet_4a19572d
Stage 2: Training
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.).
{'data': {'n_src': 2,
'sample_rate': 8000,
'segment': 3,
'task': 'sep_clean',
'train_dir': 'data/wav8k/min/metadata/train-360',
'valid_dir': 'data/wav8k/min/metadata/dev'},
'filterbank': {'kernel_size': 16, 'n_filters': 512, 'stride': 8},
'main_args': {'exp_dir': 'exp/train_convtasnet_4a19572d', 'help': None},
'masknet': {'bn_chan': 128,
'hid_chan': 512,
'mask_act': 'relu',
'n_blocks': 8,
'n_repeats': 3,
'skip_chan': 128},
'optim': {'lr': 0.001, 'optimizer': 'adam', 'weight_decay': 0.0},
'positional arguments': {},
'training': {'batch_size': 24,
'early_stop': True,
'epochs': 200,
'half_lr': True,
'num_workers': 4}}
Drop 0 utterances from 50800 (shorter than 3 seconds)
Drop 0 utterances from 3000 (shorter than 3 seconds)
Traceback (most recent call last):
File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 143, in
main(arg_dic)
File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 109, in main
trainer.fit(system)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 938, in _run
self.strategy.setup_environment()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 143, in setup_environment
self.setup_distributed()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 191, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\lightning_fabric\utilities\distributed.py", line 258, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
Thank you in advance.
The text was updated successfully, but these errors were encountered:
Reading the link, it seems that in the process of training with my GPU, the recipe is trying to use NCCL. However I'm training the model on Windows where it's not possible to work with NCCL. ¿Any ideas how can I solve this, do I have to try it in another OS or there's a way of training it without NCCL?
Hi,
I'm trying to train the librimix recipe code and I'm getting the same error when I try to use my GPU for training:
RuntimeError("Distributed package doesn't have NCCL " "built in")
torch.cuda.current_device() is returning a GPU named 0 in my python but when I enter it like this:
./run.sh --stage 2 --id 0
to train my model with the gpu it returns that runtime error.
Is it necessary to have NCCL in my systemto train the example? Or is it only that I'm making an error in the process of training.
This is my complete output in order anyone can help me:
Results from the following experiment will be stored in exp/train_convtasnet_4a19572d
Stage 2: Training
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Trainer(limit_train_batches=1.0)
was configured so 100% of the batches per epoch will be used..Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [cie-dpt-71969.dyc.a.unavarra.es]:53168 (system error: 10049 - La direcci▒n solicitada no es v▒lida en este contexto.).
{'data': {'n_src': 2,
'sample_rate': 8000,
'segment': 3,
'task': 'sep_clean',
'train_dir': 'data/wav8k/min/metadata/train-360',
'valid_dir': 'data/wav8k/min/metadata/dev'},
'filterbank': {'kernel_size': 16, 'n_filters': 512, 'stride': 8},
'main_args': {'exp_dir': 'exp/train_convtasnet_4a19572d', 'help': None},
'masknet': {'bn_chan': 128,
'hid_chan': 512,
'mask_act': 'relu',
'n_blocks': 8,
'n_repeats': 3,
'skip_chan': 128},
'optim': {'lr': 0.001, 'optimizer': 'adam', 'weight_decay': 0.0},
'positional arguments': {},
'training': {'batch_size': 24,
'early_stop': True,
'epochs': 200,
'half_lr': True,
'num_workers': 4}}
Drop 0 utterances from 50800 (shorter than 3 seconds)
Drop 0 utterances from 3000 (shorter than 3 seconds)
Traceback (most recent call last):
File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 143, in
main(arg_dic)
File "C:\Users\jaulab\Desktop\SourceSeparation\asteroid\egs\librimix\ConvTasNet\train.py", line 109, in main
trainer.fit(system)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 938, in _run
self.strategy.setup_environment()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 143, in setup_environment
self.setup_distributed()
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\pytorch_lightning\strategies\ddp.py", line 191, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\lightning_fabric\utilities\distributed.py", line 258, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jaulab\SSS_Enviroment\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
Thank you in advance.
The text was updated successfully, but these errors were encountered: