-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaMemcpyAsync throws exception in GPUDataTransfer #19076
Comments
Since the unit test are failing, it does not seem related to your model. Maybe the compilation option you are using does not work for your machine? Can you share the machine specifications and your command line? |
@xadupre Jeah sure, I am on Linux Ubuntu 22.04. My build command is as follows: ./build.sh --config Release --use_cuda --cudnn_home /usr/local/cuda --cuda_home /usr/local/cuda --build_shared_lib --skip_tests My first build was with CUDA 12.3, the latest Nvidia driver for the setup (I think 545) and onnxruntime 1.17 straight from source. The error was as shown above. Then I thought maybe the whole setup is just too recent, so I rolled back to nvidia 525, CUDA 11.8 and onnxruntime 1.16. But the result is the same unfortunately. Not really sure where to start debugging. Any hints? |
cuda-memcheck comes to mind, see if any race comes to surface. |
I would try to compile in Debug mode to see if the crash still appear. If it is, it should be easier to get the exact line causing the crash or if an error is detected before the crash. |
That...sounds completely obvious, haha. Here is a stack trace of the debug build: __pthread_kill_implementation 0x00007ffff4e969fc
__pthread_kill_internal 0x00007ffff4e969fc
__GI___pthread_kill 0x00007ffff4e969fc
__GI_raise 0x00007ffff4e42476
__GI_abort 0x00007ffff4e287f3
__assert_fail_base 0x00007ffff4e2871b
__GI___assert_fail 0x00007ffff4e39e96
onnxruntime::PlannerImpl::DecrementUseCount allocation_planner.cc:237
onnxruntime::PlannerImpl::ComputeSingleStreamReusePlan allocation_planner.cc:1443
onnxruntime::PlannerImpl::ComputeReusePlan allocation_planner.cc:1323
onnxruntime::PlannerImpl::CreatePlan allocation_planner.cc:2141
onnxruntime::SequentialPlanner::CreatePlan allocation_planner.cc:2198
onnxruntime::SessionState::FinalizeSessionStateImpl session_state.cc:1403
onnxruntime::SessionState::FinalizeSessionState session_state.cc:1186
onnxruntime::InferenceSession::Initialize inference_session.cc:1714
InitializeSession onnxruntime_c_api.cc:764
OrtApis::CreateSession onnxruntime_c_api.cc:780
Ort::Session::Session onnxruntime_cxx_inline.h:1020
__gnu_cxx::new_allocator::construct<…> new_allocator.h:162
std::allocator_traits::construct<…> alloc_traits.h:516
std::_Sp_counted_ptr_inplace::_Sp_counted_ptr_inplace<…> shared_ptr_base.h:519
std::__shared_count::__shared_count<…> shared_ptr_base.h:650
std::__shared_ptr::__shared_ptr<…> shared_ptr_base.h:1342
std::shared_ptr::shared_ptr<…> shared_ptr.h:409
std::allocate_shared<…> shared_ptr.h:863
std::make_shared<…> shared_ptr.h:879
spear::ort::Inference::loadOnnxNetwork Inference.h:128
spear::ort::Inference::Inference Inference.h:96
main superpoint_lightglue_main.cpp:37
__libc_start_call_main 0x00007ffff4e29d90
__libc_start_main_impl 0x00007ffff4e29e40
_start 0x0000555555558d15 So apparently there is already an issue during session creation. |
Describe the issue
Hey all,
I have an issue running the following model:
https://github.com/fabio-sim/LightGlue-ONNX
More specific this onnx:
https://github.com/fabio-sim/LightGlue-ONNX/releases/download/v1.0.0/superpoint_lightglue_end2end_fused.onnx
Verbose log:
verbose_log.txt
CUDA throws an exception async copying data. According to the verbose log it always seems to happen at Kernel with idx 2478.
The stack trace looks as follows:
Test report also throws some errors. Not sure if it is related:
To reproduce
Load model into onnxruntime, set two images as input, run the inference in C++.
Urgency
No
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
16.3
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
The text was updated successfully, but these errors were encountered: