Taking up memory on the primary GPU loading from checkpoint #36

pc4653 · 2020-01-28T03:52:31Z

Hi,
I'm wondering what is taking up memory on the main GPU when I am resuming training from a checkpoint. As a result i cannot train with a larger patch size, which led to memory limitation.

I think this is caused by the distributed training system and wonder if there is any way to either avoid the memory cost or distribute it evenly to other GPUs?
Many thanks!

rosinality · 2020-01-28T04:17:26Z

I think it is related to #31. You can apply PR or try to use torch.load(args.ckpt, map_location=lambda storage, loc: storage).

pc4653 · 2020-01-28T04:27:04Z

you are correct, thanks!

pc4653 closed this as completed Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking up memory on the primary GPU loading from checkpoint #36

Taking up memory on the primary GPU loading from checkpoint #36

pc4653 commented Jan 28, 2020

rosinality commented Jan 28, 2020

pc4653 commented Jan 28, 2020

Taking up memory on the primary GPU loading from checkpoint #36

Taking up memory on the primary GPU loading from checkpoint #36

Comments

pc4653 commented Jan 28, 2020

rosinality commented Jan 28, 2020

pc4653 commented Jan 28, 2020