MultiGPU efficient densenets are slow #36

wandering007 · 2018-04-30T17:05:23Z

I just want to benchmark the new implementation of efficient densenet with the code here. However, it seems that the used checkpointed modules are not broadcast to multiple GPUs as I got the following errors:

  File "/home/changmao/efficient_densenet_pytorch/models/densenet.py", line 16, in bn_function
    bottleneck_output = conv(relu(norm(concated_features)))
  File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 49, in forward
    self.training or not self.track_running_stats, self.momentum, self.eps)
  File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1194, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

I think that the checkpoint feature provides weak support for nn.DataParallel.

The text was updated successfully, but these errors were encountered:

gpleiss · 2018-05-08T12:12:56Z

Oooh @wandering007 good catch. I'll take a look.

wandering007 · 2018-05-09T08:15:10Z

@gpleiss This re-implementation (https://github.com/wandering007/efficient-densenet-pytorch) has good support for nn.DataParallel, which may be helpful.

ZhengRui · 2018-05-13T22:22:00Z

i submitted a pull request for this:
#39

gpleiss · 2018-05-13T23:26:01Z

Just merged in #39 . @wandering007 , can you confirm that this fixes the issue?

wandering007 · 2018-05-14T04:29:59Z

@gpleiss Yes, it works fine.
However, there is one thing that I've noticed before and have to mention, though it is out of the scope of this issue. With checkpointing feature, the whole autograd computation graph is broken into pieces. And the current nn.DataParallel backward process is roughly doing 1) backward in each GPU asynchronously and 2) inter-GPU communication for collecting/gathering weight gradients for each autograd computation graph. That is, if one checkpoint contains weights for updating, there would be an inter-gpu synchronization process for accumulating the gradients in it, which is time-consuming. Considering that the current efficient DenseNet contains so many checkpointed nn.BatchNorm2d modules, a lot of time will be spent on the inter-GPU communications for gradient accumulation. From my test, the backward of efficent DenseNet for multi-GPUs is at least 100x slower than the normal version...

gpleiss · 2018-05-14T12:10:51Z

@wandering007 hmmm that is problematic...

In general, I think that the checkpointing-based approach is probably what we should be doing moving forward. The original version was using some low-level calls which are no longer available in PyTorch. Using those low-level calls would require some C code, which is in my opinion undesirable for this package.

However, it sounds like the checkpointing-based code is practically unusable for the multi-GPU scenario. It's probably worthwhile bringing up an issue in the PyTorch repo about this. I'll see if there's a better solution in the meantime.

wandering007 · 2018-05-24T03:07:47Z

@gpleiss It may be tough for now...To be frank, I am still in favor of the previous implementation (v0.3.1) via _EfficientDensenetBottleneck class and _DummyBackwardHookFn function without touching any C code. I've just made some improvements on it and it seems very neat and workable for PyTorch v0.4. You can check https://github.com/wandering007/efficient-densenet-pytorch/tree/master/models if you are interested.

yzcjtr · 2018-10-21T04:09:24Z

Maybe this issue could have been made more clear in the readme. I followed the implementation in my project but found it doesn't work with dataparallel ...

gpleiss · 2018-10-21T13:20:53Z

@yzcjtr you might be experiencing a different problem. According to my tests, this should work with DataParallel. Can you post the errors that you're seeing?

theonegis · 2018-10-23T20:26:10Z

I just got the Segmentation fault (core dumped) error when running with multiple GPUs. Does anyone know how to solve this problem?

gpleiss · 2018-10-23T20:33:24Z

@theonegis can you provide more information? What version of PyTorch, what OS, what version of CUDA, what GPUs, etc.? Also, could you open up a new issue for this?

theonegis · 2018-10-23T20:45:27Z

@gpleiss I have opened a new issues. Segmentation fault (core dumped) error for multiple GPUs.
Thanks a lot.

yzcjtr · 2018-10-23T20:51:29Z

Hi @gpleiss , really sorry for my previous misunderstanding. I'm confronted with a similar situation as @theonegis . I will provide more information in his new issue. Thanks.

csrhddlam · 2018-12-01T23:35:33Z

The PyTorch official checkpointing is slow on MultiGPUs as explained by @wandering007 . https://github.com/csrhddlam/pytorch-checkpoint solves this issue.

gpleiss changed the title ~~nn.DataPallel fails for checkpoint feature~~ MultiGPU efficient densenets are slow May 23, 2018

This was referenced May 23, 2018

Version of densenet pytorch/vision#391

Closed

Checkpointing is slow on nn.DataParallel models pytorch/pytorch#7801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU efficient densenets are slow #36

MultiGPU efficient densenets are slow #36

wandering007 commented Apr 30, 2018 •

edited

Loading

gpleiss commented May 8, 2018

wandering007 commented May 9, 2018 •

edited

Loading

ZhengRui commented May 13, 2018

gpleiss commented May 13, 2018

wandering007 commented May 14, 2018 •

edited

Loading

gpleiss commented May 14, 2018

wandering007 commented May 24, 2018 •

edited

Loading

yzcjtr commented Oct 21, 2018

gpleiss commented Oct 21, 2018

theonegis commented Oct 23, 2018

gpleiss commented Oct 23, 2018

theonegis commented Oct 23, 2018

yzcjtr commented Oct 23, 2018

csrhddlam commented Dec 1, 2018

MultiGPU efficient densenets are slow #36

MultiGPU efficient densenets are slow #36

Comments

wandering007 commented Apr 30, 2018 • edited Loading

gpleiss commented May 8, 2018

wandering007 commented May 9, 2018 • edited Loading

ZhengRui commented May 13, 2018

gpleiss commented May 13, 2018

wandering007 commented May 14, 2018 • edited Loading

gpleiss commented May 14, 2018

wandering007 commented May 24, 2018 • edited Loading

yzcjtr commented Oct 21, 2018

gpleiss commented Oct 21, 2018

theonegis commented Oct 23, 2018

gpleiss commented Oct 23, 2018

theonegis commented Oct 23, 2018

yzcjtr commented Oct 23, 2018

csrhddlam commented Dec 1, 2018

wandering007 commented Apr 30, 2018 •

edited

Loading

wandering007 commented May 9, 2018 •

edited

Loading

wandering007 commented May 14, 2018 •

edited

Loading

wandering007 commented May 24, 2018 •

edited

Loading