ORT 1.18 crashes on exit after using Cuda EP to run inference on a specific model #21207

frenetj · 2024-06-28T15:04:32Z

Describe the issue

We experience crashes on exit (static uninit phase) when running inference on a specific model with the CUDA EP:

#0  0x00007f7a5634537f in raise () at /lib64/libc.so.6
#1  0x00007f7a5632fdb5 in abort () at /lib64/libc.so.6
#2  0x00007f7a563884e7 in __libc_message () at /lib64/libc.so.6
#3  0x00007f7a5638f5ec in .annobin_top_check.start () at /lib64/libc.so.6
#4  0x00007f7a5638fe2c in unlink_chunk.isra () at /lib64/libc.so.6
#5  0x00007f7a5638ff97 in malloc_consolidate () at /lib64/libc.so.6
#6  0x00007f7a56392368 in _int_malloc () at /lib64/libc.so.6
#7  0x00007f7a563948d6 in calloc () at /lib64/libc.so.6
#8  0x00007f7aac588224 in calloc(size_t, size_t) (nelem=8, elsize=157) at develop/src/libheap/allocwrap.C:395
#9  0x00007f7a6e437817 in  () at /mnt/repo/bin/libcublas.so.11
#10 0x00007f7a6e43a914 in  () at /mnt/repo/bin/libcublas.so.11
#11 0x00007f7a6e42a15c in  () at /mnt/repo/bin/libcublas.so.11
#12 0x00007f7a6e42bcf8 in  () at /mnt/repo/bin/libcublas.so.11
#13 0x00007f7a56348037 in __cxa_finalize () at /lib64/libc.so.6
#14 0x00007f7a6dc3c8c3 in  () at /mnt/repo/bin/libcublas.so.11
#15 0x00007ffc4e07d630 in  ()
#16 0x00007f7aac3c5c96 in _dl_fini () at /lib64/ld-linux-x86-64.so.2

Here is a different stack I got when using a debug build of ORT:

#0  0x00007f84b391437f in raise () at /lib64/libc.so.6
#1  0x00007f84b38fedb5 in abort () at /lib64/libc.so.6
#2  0x00007f84b42ce09b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () at /lib64/libstdc++.so.6
#3  0x00007f84b42d453c in __cxxabiv1::__terminate(void (*)()) () at /lib64/libstdc++.so.6
#4  0x00007f84b42d4597 in  () at /lib64/libstdc++.so.6
#5  0x00007f84b42d53f5 in  () at /lib64/libstdc++.so.6
#6  0x00007f8490639c50 in onnxruntime::ProviderSharedLibrary::Unload() (this=0x7f84925db1c0 <onnxruntime::s_library_shared>)
    at /repo/onnx/onnxruntime-1.18.0/onnxruntime/core/session/provider_bridge_ort.cc:1385
#7  0x00007f849063a5ce in onnxruntime::UnloadSharedProviders() () at /repo/onnx/onnxruntime-1.18.0/onnxruntime/core/session/provider_bridge_ort.cc:1534
#8  0x00007f8490629229 in OrtEnv::~OrtEnv() (this=0x1cb10580, __in_chrg=<optimized out>) at /repo/onnx/onnxruntime-1.18.0/onnxruntime/core/session/ort_env.cc:31
#9  0x00007f849062a510 in std::default_delete<OrtEnv>::operator()(OrtEnv*) const (this=0x7f84925db118 <OrtEnv::p_instance_>, __ptr=0x1cb10580) at /usr/include/c++/8/bits/unique_ptr.h:81
#10 0x00007f8490629c83 in std::unique_ptr<OrtEnv, std::default_delete<OrtEnv> >::~unique_ptr() (this=0x7f84925db118 <OrtEnv::p_instance_>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/unique_ptr.h:277
#11 0x00007f84b3917037 in __cxa_finalize () at /lib64/libc.so.6
#12 0x00007f84905dfc17 in __do_global_dtors_aux () at /mnt/onnxruntime/lib/libonnxruntime.so.1.18.0
#13 0x00007ffd05bf6d20 in  ()
#14 0x00007f8509a2bc96 in _dl_fini () at /lib64/ld-linux-x86-64.so.2

It seems to be a regression in ORT 1.18, since we could not repro in ORT 1.14.1.

We are using default CUDA OrtCUDAProviderOptions values.
We are using the default Arena allocator.

To reproduce

Run inference on our model (which will be provided to Microsoft directly as it has some proprietary content) with any input using the CUDA EP.
You will crash exiting the program.

Urgency

No response

Platform

Linux

OS Version

ROCKY 8.5 (gcc-11.2.1, c++17)

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

C

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

The text was updated successfully, but these errors were encountered:

snnn · 2024-06-28T15:07:54Z

Please do not define OrtEnv as a global variable(or function local static). Please explicitly delete it before main function returns.

frenetj · 2024-06-28T15:12:16Z

Sure thing. Thanks a lot for the quick reply.

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jun 28, 2024

frenetj closed this as completed Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORT 1.18 crashes on exit after using Cuda EP to run inference on a specific model #21207

ORT 1.18 crashes on exit after using Cuda EP to run inference on a specific model #21207

frenetj commented Jun 28, 2024

snnn commented Jun 28, 2024 •

edited

Loading

frenetj commented Jun 28, 2024

ORT 1.18 crashes on exit after using Cuda EP to run inference on a specific model #21207

ORT 1.18 crashes on exit after using Cuda EP to run inference on a specific model #21207

Comments

frenetj commented Jun 28, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

snnn commented Jun 28, 2024 • edited Loading

frenetj commented Jun 28, 2024

snnn commented Jun 28, 2024 •

edited

Loading