Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The error when training #3

Open
chengshuai opened this issue Nov 14, 2017 · 6 comments
Open

The error when training #3

chengshuai opened this issue Nov 14, 2017 · 6 comments

Comments

@chengshuai
Copy link

@unsky , a nice work

When training, the error occur. the details is bellow:
#########################################train#######################################
./data/VOCdevkit2007/VOC2007/JPEGImages/2009_002123.jpg
./data/VOCdevkit2007/VOC2007/JPEGImages/000783.jpg
[08:24:35] /home/chengshuai/mx-maskrcnn-master1/incubator-mxnet/dmlc-core/include/dmlc/logging.h:308: [08:24:35] /home/chengshuai/mx-maskrcnn-master1/incubator-mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:58: too large launch parameter: Softmax[89847,1], [256,1,1]

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0b7ad7f70c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda16CheckLaunchParamE4dim3S1_PKc+0x165) [0x7f0b7d3e83f5]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN7mshadow4cuda7SoftmaxIfEEvRKNS_6TensorINS_3gpuELi2ET_EES7+0xfa) [0x7f0b7e3ec24a]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op19SoftmaxActivationOpIN7mshadow3gpuEE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x20b) [0x7f0b7e4fe57b]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op13OperatorState7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS6_EERKS5_INS_9OpReqTypeESaISB_EESA+0x354) [0x7f0b7d04a524]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZZN5mxnet10imperative12PushOperatorERKNS_10OpStatePtrEPKN4nnvm2OpERKNS4_9NodeAttrsERKNS_7ContextERKSt6vectorIPNS_6engine3VarESaISH_EESL_RKSE_INS_8ResourceESaISM_EERKSE_IPNS_7NDArrayESaISS_EESW_RKSE_IjSaIjEERKSE_INS_9OpReqTypeESaIS11_EENS_12DispatchModeEENKUlNS_10RunContextENSF_18CallbackOnCompleteEE0_clES17_S18+0x2a0) [0x7f0b7cec2950]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x9d) [0x7f0b7ce3fc6d]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0xf3) [0x7f0b7ce43cb3]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5+0x56) [0x7f0b7ce43e96]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7f0b7ce410cb]

[08:24:35] /home/chengshuai/mx-maskrcnn-master1/incubator-mxnet/dmlc-core/include/dmlc/logging.h:308: [08:24:35] src/engine/./threaded_engine.h:370: [08:24:35] /home/chengshuai/mx-maskrcnn-master1/incubator-mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:58: too large launch parameter: Softmax[89847,1], [256,1,1]

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0b7ad7f70c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda16CheckLaunchParamE4dim3S1_PKc+0x165) [0x7f0b7d3e83f5]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN7mshadow4cuda7SoftmaxIfEEvRKNS_6TensorINS_3gpuELi2ET_EES7+0xfa) [0x7f0b7e3ec24a]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op19SoftmaxActivationOpIN7mshadow3gpuEE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x20b) [0x7f0b7e4fe57b]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op13OperatorState7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS6_EERKS5_INS_9OpReqTypeESaISB_EESA+0x354) [0x7f0b7d04a524]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZZN5mxnet10imperative12PushOperatorERKNS_10OpStatePtrEPKN4nnvm2OpERKNS4_9NodeAttrsERKNS_7ContextERKSt6vectorIPNS_6engine3VarESaISH_EESL_RKSE_INS_8ResourceESaISM_EERKSE_IPNS_7NDArrayESaISS_EESW_RKSE_IjSaIjEERKSE_INS_9OpReqTypeESaIS11_EENS_12DispatchModeEENKUlNS_10RunContextENSF_18CallbackOnCompleteEE0_clES17_S18+0x2a0) [0x7f0b7cec2950]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x9d) [0x7f0b7ce3fc6d]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0xf3) [0x7f0b7ce43cb3]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5+0x56) [0x7f0b7ce43e96]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7f0b7ce410cb]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0b7ad7f70c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x3a0) [0x7f0b7ce3ff70]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0xf3) [0x7f0b7ce43cb3]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5+0x56) [0x7f0b7ce43e96]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7f0b7ce410cb]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f0b9d5c7a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7f0ba193a182]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f0ba166747d]

terminate called after throwing an instance of 'dmlc::Error'
what(): [08:24:35] src/engine/./threaded_engine.h:370: [08:24:35] /home/chengshuai/mx-maskrcnn-master1/incubator-mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:58: too large launch parameter: Softmax[89847,1], [256,1,1]

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0b7ad7f70c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda16CheckLaunchParamE4dim3S1_PKc+0x165) [0x7f0b7d3e83f5]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN7mshadow4cuda7SoftmaxIfEEvRKNS_6TensorINS_3gpuELi2ET_EES7+0xfa) [0x7f0b7e3ec24a]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op19SoftmaxActivationOpIN7mshadow3gpuEE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x20b) [0x7f0b7e4fe57b]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op13OperatorState7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS6_EERKS5_INS_9OpReqTypeESaISB_EESA+0x354) [0x7f0b7d04a524]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZZN5mxnet10imperative12PushOperatorERKNS_10OpStatePtrEPKN4nnvm2OpERKNS4_9NodeAttrsERKNS_7ContextERKSt6vectorIPNS_6engine3VarESaISH_EESL_RKSE_INS_8ResourceESaISM_EERKSE_IPNS_7NDArrayESaISS_EESW_RKSE_IjSaIjEERKSE_INS_9OpReqTypeESaIS11_EENS_12DispatchModeEENKUlNS_10RunContextENSF_18CallbackOnCompleteEE0_clES17_S18+0x2a0) [0x7f0b7cec2950]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x9d) [0x7f0b7ce3fc6d]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0xf3) [0x7f0b7ce43cb3]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5+0x56) [0x7f0b7ce43e96]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7f0b7ce410cb]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0b7ad7f70c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x3a0) [0x7f0b7ce3ff70]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine23ThreadedEnginePerDevice9GPUWorkerILN4dmlc19ConcurrentQueueTypeE0EEEvNS_7ContextEbPNS1_17ThreadWorkerBlockIXT_EEESt10shared_ptrINS0_10ThreadPool11SimpleEventEE+0xf3) [0x7f0b7ce43cb3]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataS5+0x56) [0x7f0b7ce43e96]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.12.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x3b) [0x7f0b7ce410cb]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f0b9d5c7a60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7f0ba193a182]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f0ba166747d]

What is the problem? Does the mxnet version result in?

Thanks!

@unsky
Copy link
Owner

unsky commented Nov 14, 2017

@chengshuai is retinanet?
or ?mx-maskrcnn-master1?

@chengshuai
Copy link
Author

@unsky

thank you for you reply! I train the retinanet.
when training, the link direct the mx-maskrcnn-master1/incubator-mxnet(mxnet-0.12.1).
Should i use the mxnet version in focal loss(https://github.com/unsky/focal-loss)?

@unsky
Copy link
Owner

unsky commented Nov 14, 2017

@chengshuai yes

@feiyilicare
Copy link

@chengshuai @unsky i have the same problem, but my mxnet version is 0.9.5. i still have the same wrong message, do you solve this problem

@chengshuai
Copy link
Author

@unsky

I use the The mxnet version in focal loss(https://github.com/unsky/focal-loss). but have the same wrong. And the mxnet version is 0.11.0.(Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.0-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ffbb3b722fc])

@feiyilicare i can not find the mxnet 0.9.5. Could you give me the link for downloading the mxnet 0.9.5?

@wyz2016
Copy link

wyz2016 commented Jul 4, 2018

@chengshuai
Have u solved this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants