Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking up memory on the primary GPU loading from checkpoint #36

Closed
pc4653 opened this issue Jan 28, 2020 · 2 comments
Closed

Taking up memory on the primary GPU loading from checkpoint #36

pc4653 opened this issue Jan 28, 2020 · 2 comments

Comments

@pc4653
Copy link

pc4653 commented Jan 28, 2020

Hi,
I'm wondering what is taking up memory on the main GPU when I am resuming training from a checkpoint. As a result i cannot train with a larger patch size, which led to memory limitation.
Screenshot from 2020-01-27 22-50-26

I think this is caused by the distributed training system and wonder if there is any way to either avoid the memory cost or distribute it evenly to other GPUs?
Many thanks!

@rosinality
Copy link
Owner

I think it is related to #31. You can apply PR or try to use torch.load(args.ckpt, map_location=lambda storage, loc: storage).

@pc4653
Copy link
Author

pc4653 commented Jan 28, 2020

you are correct, thanks!

@pc4653 pc4653 closed this as completed Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants