You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to train a small net on my own dataset. AWS P2 machine with ~12GB of GPU memory.
Getting the error below. Do you know what I can do, perhaps reduce batch size or something? How do I do that?
[05/16 14:46:32 d2.data.build]: Using training sampler TrainingSampler
[05/16 14:46:32 fvcore.common.checkpoint]: Loading checkpoint from https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:32 fvcore.common.file_io]: URL https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1 cached in /home/ubuntu/.torch/fvcore_cache/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:33 fvcore.common.checkpoint]: Some model parameters or buffers are not in the checkpoint:
backbone.fpn_output5.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_predictor.bbox_pred.{weight, bias}
roi_heads.mask_head.mask_fcn3.{weight, bias}
roi_heads.mask_head.predictor.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output3.{weight, bias}
proposal_generator.anchor_generator.cell_anchors.{0, 2, 3, 4, 1}
proposal_generator.rpn_head.conv.{weight, bias}
roi_heads.box_predictor.cls_score.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{weight, bias}
roi_heads.mask_head.mask_fcn1.{weight, bias}
roi_heads.mask_head.mask_fcn2.{weight, bias}
backbone.fpn_output2.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral4.{weight, bias}
backbone.fpn_lateral5.{weight, bias}
backbone.fpn_lateral3.{weight, bias}
[05/16 14:46:33 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
backbone.bottom_up.stem.stem_1/norm.num_batches_tracked
backbone.bottom_up.stem.stem_2/norm.num_batches_tracked
backbone.bottom_up.stem.stem_3/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.layers.0.OSA2_1_0/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.layers.1.OSA2_1_1/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.layers.2.OSA2_1_2/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.concat.OSA2_1_concat/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.layers.0.OSA3_1_0/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.layers.1.OSA3_1_1/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.layers.2.OSA3_1_2/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.concat.OSA3_1_concat/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.layers.0.OSA4_1_0/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.layers.1.OSA4_1_1/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.layers.2.OSA4_1_2/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.concat.OSA4_1_concat/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.layers.0.OSA5_1_0/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.layers.1.OSA5_1_1/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.layers.2.OSA5_1_2/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.concat.OSA5_1_concat/norm.num_batches_tracked
[05/16 14:46:33 d2.engine.train_loop]: Starting training from iteration 0
ERROR [05/16 14:46:38 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
loss_dict = self.model(data)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
features = self.backbone(images.tensor)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
bottom_up_features = self.bottom_up(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
x = getattr(self, name)(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
xt = self.concat(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)
[05/16 14:46:38 d2.engine.hooks]: Total training time: 0:00:05 (0:00:00 on hooks)
Traceback (most recent call last):
File "train_net_docs.py", line 115, in <module>
dist_url=args.dist_url,
File "/home/ubuntu/detectron2/detectron2/engine/launch.py", line 57, in launch
main_func(*args)
File "train_net_docs.py", line 93, in main
trainer.resume_or_load(resume=args.resume)
File "/home/ubuntu/detectron2/detectron2/engine/defaults.py", line 401, in train
super().train(self.start_iter, self.max_iter)
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
loss_dict = self.model(data)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
features = self.backbone(images.tensor)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
bottom_up_features = self.bottom_up(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
x = getattr(self, name)(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
xt = self.concat(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)
The text was updated successfully, but these errors were encountered:
I'm trying to train a small net on my own dataset. AWS P2 machine with ~12GB of GPU memory.
Getting the error below. Do you know what I can do, perhaps reduce batch size or something? How do I do that?
The text was updated successfully, but these errors were encountered: