Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORT 1.18 crashes on exit after using Cuda EP to run inference on a specific model #21207

Closed
frenetj opened this issue Jun 28, 2024 · 2 comments
Closed
Labels
ep:CUDA issues related to the CUDA execution provider

Comments

@frenetj
Copy link

frenetj commented Jun 28, 2024

Describe the issue

We experience crashes on exit (static uninit phase) when running inference on a specific model with the CUDA EP:

#0  0x00007f7a5634537f in raise () at /lib64/libc.so.6
#1  0x00007f7a5632fdb5 in abort () at /lib64/libc.so.6
#2  0x00007f7a563884e7 in __libc_message () at /lib64/libc.so.6
#3  0x00007f7a5638f5ec in .annobin_top_check.start () at /lib64/libc.so.6
#4  0x00007f7a5638fe2c in unlink_chunk.isra () at /lib64/libc.so.6
#5  0x00007f7a5638ff97 in malloc_consolidate () at /lib64/libc.so.6
#6  0x00007f7a56392368 in _int_malloc () at /lib64/libc.so.6
#7  0x00007f7a563948d6 in calloc () at /lib64/libc.so.6
#8  0x00007f7aac588224 in calloc(size_t, size_t) (nelem=8, elsize=157) at develop/src/libheap/allocwrap.C:395
#9  0x00007f7a6e437817 in  () at /mnt/repo/bin/libcublas.so.11
#10 0x00007f7a6e43a914 in  () at /mnt/repo/bin/libcublas.so.11
#11 0x00007f7a6e42a15c in  () at /mnt/repo/bin/libcublas.so.11
#12 0x00007f7a6e42bcf8 in  () at /mnt/repo/bin/libcublas.so.11
#13 0x00007f7a56348037 in __cxa_finalize () at /lib64/libc.so.6
#14 0x00007f7a6dc3c8c3 in  () at /mnt/repo/bin/libcublas.so.11
#15 0x00007ffc4e07d630 in  ()
#16 0x00007f7aac3c5c96 in _dl_fini () at /lib64/ld-linux-x86-64.so.2

Here is a different stack I got when using a debug build of ORT:

#0  0x00007f84b391437f in raise () at /lib64/libc.so.6
#1  0x00007f84b38fedb5 in abort () at /lib64/libc.so.6
#2  0x00007f84b42ce09b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () at /lib64/libstdc++.so.6
#3  0x00007f84b42d453c in __cxxabiv1::__terminate(void (*)()) () at /lib64/libstdc++.so.6
#4  0x00007f84b42d4597 in  () at /lib64/libstdc++.so.6
#5  0x00007f84b42d53f5 in  () at /lib64/libstdc++.so.6
#6  0x00007f8490639c50 in onnxruntime::ProviderSharedLibrary::Unload() (this=0x7f84925db1c0 <onnxruntime::s_library_shared>)
    at /repo/onnx/onnxruntime-1.18.0/onnxruntime/core/session/provider_bridge_ort.cc:1385
#7  0x00007f849063a5ce in onnxruntime::UnloadSharedProviders() () at /repo/onnx/onnxruntime-1.18.0/onnxruntime/core/session/provider_bridge_ort.cc:1534
#8  0x00007f8490629229 in OrtEnv::~OrtEnv() (this=0x1cb10580, __in_chrg=<optimized out>) at /repo/onnx/onnxruntime-1.18.0/onnxruntime/core/session/ort_env.cc:31
#9  0x00007f849062a510 in std::default_delete<OrtEnv>::operator()(OrtEnv*) const (this=0x7f84925db118 <OrtEnv::p_instance_>, __ptr=0x1cb10580) at /usr/include/c++/8/bits/unique_ptr.h:81
#10 0x00007f8490629c83 in std::unique_ptr<OrtEnv, std::default_delete<OrtEnv> >::~unique_ptr() (this=0x7f84925db118 <OrtEnv::p_instance_>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/unique_ptr.h:277
#11 0x00007f84b3917037 in __cxa_finalize () at /lib64/libc.so.6
#12 0x00007f84905dfc17 in __do_global_dtors_aux () at /mnt/onnxruntime/lib/libonnxruntime.so.1.18.0
#13 0x00007ffd05bf6d20 in  ()
#14 0x00007f8509a2bc96 in _dl_fini () at /lib64/ld-linux-x86-64.so.2

It seems to be a regression in ORT 1.18, since we could not repro in ORT 1.14.1.

We are using default CUDA OrtCUDAProviderOptions values.
We are using the default Arena allocator.

To reproduce

Run inference on our model (which will be provided to Microsoft directly as it has some proprietary content) with any input using the CUDA EP.
You will crash exiting the program.

Urgency

No response

Platform

Linux

OS Version

ROCKY 8.5 (gcc-11.2.1, c++17)

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

C

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jun 28, 2024
@snnn
Copy link
Member

snnn commented Jun 28, 2024

Please do not define OrtEnv as a global variable(or function local static). Please explicitly delete it before main function returns.

@frenetj
Copy link
Author

frenetj commented Jun 28, 2024

Sure thing. Thanks a lot for the quick reply.

@frenetj frenetj closed this as completed Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests

2 participants