Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory in training ScanNet #27

Closed
y5wang opened this issue Mar 11, 2020 · 3 comments
Closed

CUDA out of memory in training ScanNet #27

y5wang opened this issue Mar 11, 2020 · 3 comments

Comments

@y5wang
Copy link

y5wang commented Mar 11, 2020

Trying to train ScanNet scene segmentation, but run into CUDA out of memory error. My environment:

  • Ubuntu 18.04
  • Python 3.7
  • PyTorch 1.4
  • CUDA toolkit 10.1
  • Tesla K80 with 12GB ram each GPU (also tried on GeForce RTX 2070 with 8GB ram)

The training was started by:
export BATCH_SIZE=8; ./scripts/train_scannet.sh 2 -default "--scannet_path ./data/scannet/train"
(I've tried BATCH_SIZE=32, and BATCH_SIZE=16, both failed)

Here is the error dump:

...

microway-gpu-ubuntu 03/11 17:50:03 ===> Start training
/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:224: UserWarning: To get the last learning rate computed by the scheduler, please use 'get_last_lr()'.
  warnings.warn("To get the last learning rate computed by the scheduler, "
microway-gpu-ubuntu 03/11 17:50:32 ===> Epoch[1](1/151): Loss 3.1041    LR: 1.000e-01   Score 4.961     Data time: 5.2196, Total iter time: 29.8803
Traceback (most recent call last):
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/yang/projects/SpatioTemporalSegmentation/main.py", line 156, in <module>
    main()
  File "/home/yang/projects/SpatioTemporalSegmentation/main.py", line 149, in main
    train(model, train_data_loader, val_data_loader, config)
  File "/home/yang/projects/SpatioTemporalSegmentation/lib/train.py", line 91, in train
    soutput = model(*inputs)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yang/projects/SpatioTemporalSegmentation/models/res16unet.py", line 252, in forward
    out = self.block8(out)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yang/projects/SpatioTemporalSegmentation/models/modules/resnet_block.py", line 47, in forward
    out = self.norm2(out)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/MinkowskiEngine/MinkowskiNormalization.py", line 58, in forward
    output = self.bn(input.F)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
    exponential_average_factor, self.eps)
  File "/home/yang/.conda/envs/st-segmentation/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 364.00 MiB (GPU 0; 11.17 GiB total capacity; 9.93 GiB already allocated; 247.81 MiB free; 10.26 GiB reserved in total by PyTorch)

Any help is appreciated.

-- Yang

@fengziyue
Copy link
Contributor

I'm training on S3DIS dataset, on 2080ti(11GB) it also occurred the error "cuda out of memory", It seems that changing the batch size won't reduce GPU memory usage.

But it runs well on tesla p100 (12GB).

@chrischoy

@y5wang
Copy link
Author

y5wang commented Mar 11, 2020

Guess Tesla K80 doesn't have enough memory. It has 11441MB as reported by nvidia-smi :

(st-segmentation) yang@microway-gpu-ubuntu:~/projects/SpatioTemporalSegmentation$ nvidia-smi
Wed Mar 11 15:08:31 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:0A:00.0 Off |                    0 |
| N/A   30C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:0B:00.0 Off |                    0 |
| N/A   24C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:10:00.0 Off |                    0 |
| N/A   32C    P8    25W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:11:00.0 Off |                    0 |
| N/A   22C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:14:00.0 Off |                    0 |
| N/A   32C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:15:00.0 Off |                    0 |
| N/A   22C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+```

@chrischoy
Copy link
Owner

This is not a bug. OOM states that you need more memory.
Please lower the batch size to run it on your environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants