Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU OOM in val stage when without mask #101

Open
NNsauce opened this issue Jul 10, 2023 · 2 comments
Open

GPU OOM in val stage when without mask #101

NNsauce opened this issue Jul 10, 2023 · 2 comments

Comments

@NNsauce
Copy link

NNsauce commented Jul 10, 2023

hi, when I run "python launch.py --config configs/neus-dtu-wmask.yaml --gpu 1 --train", everything is ok,
but when I run "python launch.py --config configs/neus-dtu.yaml --gpu 1 --train", it got CUDA out of memory .
I am using the latest code where you've modify the chunk_batch function in models/utils.py as you said "move all output tensors to cpu before merging". I even set dynamic_ray_sampling=false or reduce max_train_num_rays to 2048, but the CUDA out of memory still happens.Could you please give me some advice, thx!!

@NNsauce
Copy link
Author

NNsauce commented Jul 10, 2023

here is complete error:

Epoch 0: : 0it [00:00, ?it/s]Update finite_difference_eps to 0.027204705103003882
Epoch 0: : 500it [00:26, 18.89it/s, loss=0.0754, train/inv_s=42.50, train/num_rays=512.0] Traceback (most recent call last): | 0/49 [00:00<?, ?it/s]
File "launch.py", line 125, in
main()
File "launch.py", line 114, in main
trainer.fit(system, datamodule=dm)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 250, in on_advance_end
self._run_validation()
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 308, in _run_validation
self.val_loop.run()
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 359, in validation_step
return self.model(*args, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 110, in forward
return self._forward_module.validation_step(*inputs, **kwargs)
File "/home/fzx/work/instant-nsr-pl/systems/neus.py", line 172, in validation_step
out = self(batch)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in call_impl
return forward_call(input, **kwargs)
File "/home/fzx/work/instant-nsr-pl/systems/neus.py", line 32, in forward
return self.model(batch['rays'])
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in call_impl
return forward_call(*input, **kwargs)
File "/home/fzx/work/instant-nsr-pl/models/neus.py", line 293, in forward
out = chunk_batch(self.forward
, self.config.ray_chunk, True, rays)
File "/home/fzx/work/instant-nsr-pl/models/utils.py", line 24, in chunk_batch
out_chunk = func(
[arg[i:i+chunk_size] if isinstance(arg, torch.Tensor) else arg for arg in args], **kwargs)
File "/home/fzx/work/instant-nsr-pl/models/neus.py", line 230, in forward

sdf, sdf_grad, feature, sdf_laplace = self.geometry(positions, with_grad=True, with_feature=True, with_laplace=True)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fzx/work/instant-nsr-pl/models/geometry.py", line 195, in forward
points_d_sdf = self.network(self.encoding(points_d.view(-1, 3)))[...,0].view(*points.shape[:-1], 6).float()
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/fzx/work/instant-nsr-pl/models/network_utils.py", line 110, in forward
x = self.layers(x.float())
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fzx/mambaforge/envs/sdf/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 838, in forward
return F.softplus(input, self.beta, self.threshold)
RuntimeError: CUDA out of memory. Tried to allocate 1.19 GiB (GPU 0; 7.80 GiB total capacity; 3.63 GiB already allocated; 1.17 GiB free; 4.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0: : 500it [00:29, 16.70it/s, loss=0.0754, train/inv_s=42.50, train/num_rays=512.0]

@NNsauce
Copy link
Author

NNsauce commented Jul 12, 2023

I uh reduce chunk_size from 2048 to 1024, then it works. But why does it need so much more GPU memory when without mask?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant