How to train using distributed mode？ #19

handyzeng · 2023-03-30T04:06:04Z

When I want to train it using dist_train.sh tool, I got errors as flowing:

with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
File "./tools/train.py", line 108, in
main()
File "./tools/train.py", line 104, in main
logger=logger)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/apis/train.py", line 62, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/apis/train.py", line 256, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/sas_data/e01163/C-HOI/workspace/mmcv/mmcv/runner/runner.py", line 368, in run
epoch_runner(data_loaders[i], **kwargs)
File "/sas_data/e01163/C-HOI/workspace/mmcv/mmcv/runner/runner.py", line 267, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/apis/train.py", line 41, in batch_processor
losses = model(**data)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 888, in forward
output = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/models/detectors/base.py", line 95, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/models/detectors/cascade_rcnn_rel.py", line 414, in forward_train
x = self.extract_feat(img)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/models/detectors/cascade_rcnn_rel.py", line 165, in extract_feat
x = self.backbone(img)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/sas_data/e01163/C-HOI/workspace/C-HOI/mmdet/models/backbones/resnet.py", line 506, in forward
x = self.conv1(x)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 446, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
self.padding, self.dilation, self.groups)
TypeError: conv2d() received an invalid combination of arguments - got (DataContainer, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:

(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (DataContainer, Parameter, NoneType, tuple, tuple, tuple, int)
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (DataContainer, Parameter, NoneType, tuple, tuple, tuple, int)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train using distributed mode？ #19

How to train using distributed mode？ #19

handyzeng commented Mar 30, 2023

How to train using distributed mode？ #19

How to train using distributed mode？ #19

Comments

handyzeng commented Mar 30, 2023