Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very small miou #19

Closed
caoyifeng001 opened this issue Nov 8, 2020 · 8 comments
Closed

very small miou #19

caoyifeng001 opened this issue Nov 8, 2020 · 8 comments

Comments

@caoyifeng001
Copy link

i train this net but i got very small miou

Validation per class iou:
car : 4.74%
bicycle : 0.03%
motorcycle : 0.03%
truck : 0.14%
bus : 0.55%
person : 0.03%
bicyclist : 0.05%
motorcyclist : 0.00%
road : 0.44%
parking : 0.85%
sidewalk : 12.27%
other-ground : 0.08%
building : 3.78%
fence : 1.65%
vegetation : 1.42%
trunk : 1.62%
terrain : 4.21%
pole : 0.37%
traffic-sign : 0.17%
Current val miou is 1.707 while the best val miou is 1.707
Current val loss is 3.895
epoch 6 iter 2610, loss: nan

@edwardzhou130
Copy link
Owner

My guess is something goes wrong during the training cause your training loss becomes nan. Did you get the 4000 exceptions encountered during the last training message?

@caoyifeng001
Copy link
Author

yes i get this message

@caoyifeng001
Copy link
Author

and this message i do not know

CUDA out of memory. Tried to allocate 802.00 MiB (GPU 1; 10.76 GiB total capacity; 8.28 GiB already allocated; 678.12 MiB free; 8.48 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbab6a4a536 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7fbabbc44f1e in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7fbabbc45f9e in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x135 (0x7fba57d91535 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xf7a66b (0x7fba5638966b in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xfc3f57 (0x7fba563d2f57 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1075389 (0x7fba9290d389 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x10756c7 (0x7fba9290d6c7 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0xe2165e (0x7fba926b965e in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::empty_like(at::Tensor const&, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x9e0 (0x7fba926bff50 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x1134321 (0x7fba929cc321 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x1187623 (0x7fba92a1f623 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::native::contiguous(at::Tensor const&, c10::MemoryFormat) + 0x3bc (0x7fba926de44c in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x1136678 (0x7fba929ce678 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x1186f9f (0x7fba92a1ef9f in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0xf22a40 (0x7fba56331a40 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #16: at::Tensor at::native::(anonymous namespace)::host_softmax_backward<at::native::(anonymous namespace)::SoftMaxBackwardEpilogue, false>(at::Tensor const&, at::Tensor const&, long, bool) + 0x16f (0x7fba57cd4b3f in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #17: at::native::softmax_backward_cuda(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x19c (0x7fba57cbed3c in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #18: + 0xf8bea0 (0x7fba5639aea0 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #19: + 0x10c5ad6 (0x7fba9295dad6 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x2b4dd6c (0x7fba943e5d6c in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x10c5ad6 (0x7fba9295dad6 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::generated::SoftmaxBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1c9 (0x7fba9413db79 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0x2d89c05 (0x7fba94621c05 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #24: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fba9461ef03 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #25: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fba9461fce2 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #26: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fba94618359 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #27: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fbab71864d8 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #28: + 0xd0840 (0x7fbabc778840 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #29: + 0x76ba (0x7fbac0cb86ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #30: clone + 0x6d (0x7fbac09ee4dd in /lib/x86_64-linux-gnu/libc.so.6)

@edwardzhou130
Copy link
Owner

edwardzhou130 commented Nov 8, 2020

It seems like you don't have enough GPU memory for the training. You can try training model with a smaller feature map, like python train.py --grid_size 320 240 32.

@caoyifeng001
Copy link
Author

If I set the batch size to 1 , will it affect the accuracy

@edwardzhou130
Copy link
Owner

I haven't tried training with batch size 1, but it should have a similar result.

@caoyifeng001
Copy link
Author

thank for your help and this great job.

@edwardzhou130
Copy link
Owner

Close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants