Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Usage 100% #17942

Closed
lfch opened this issue Oct 13, 2023 · 3 comments
Closed

GPU Usage 100% #17942

lfch opened this issue Oct 13, 2023 · 3 comments
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider stale issues that have not been addressed in a while; categorized by a bot

Comments

@lfch
Copy link

lfch commented Oct 13, 2023

Describe the issue

Issue Description

I wrapped a golang version of onnxruntime gpu for our model inference. The onnxruntime so library is loaded only once when service starts. We have a dedicated OrtSession object is created for a new model version and get it destroyed when a newer version comes. The older OrtSession object is destroyed safely with the guard of read-write lock.
When the service run for several hour with continuous model version update, we got the following situation.

  1. GPU usage arises to 100% and keeps it all the time. refer to image 0
  2. Two threads occupied 200% of two CPU core all the time. refer image 1 and their stacks refers to following code snippet

Expected Behavior

runs normally and do not occur the previous situation.

Versions

onnxruntime gpu == 1.15.1 download from github release page
cuda == 11.2
gpu device = A30

Files

image 0, gpu usage

截屏2023-10-13 下午10 02 14

image 1, cpu usage

截屏2023-10-13 下午10 02 28

thread stack

`Thread 15 (Thread 0x7f175d7fa700 (LWP 69607)):
#0 0x00007f15bb484bec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007f15bb6abd62 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007f15bb6ac879 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007f15bb7e7450 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f15bb439ce3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007f15bb43a1d1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007f15bb43b138 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007f15bb60d251 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8 0x00007f17140f04e9 in ?? () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudart.so.11.0
#9 0x00007f17140ca9ed in ?? () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudart.so.11.0
#10 0x00007f171410ee96 in cudaMemcpyAsync () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudart.so.11.0
#11 0x00007f158b6a381d in onnxruntime::GPUDataTransfer::CopyTensorAsync(onnxruntime::Tensor const&, onnxruntime::Tensor&, onnxruntime::Stream&) const () from /root/go/src/lib/libonnxruntime_providers_cuda.so
#12 0x00007f16937fa5a3 in ?? () from /root/go/src/lib//libonnxruntime.so
#13 0x00007f1693091f48 in ?? () from /root/go/src/lib//libonnxruntime.so
#14 0x00007f158b891497 in onnxruntime::IDataTransfer::CopyTensors(std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocatoronnxruntime::IDataTransfer::SrcDstPair > const&) const () from /root/go/src/lib/libonnxruntime_providers_cuda.so
#15 0x00007f16937fb58c in ?? () from /root/go/src/lib//libonnxruntime.so
#16 0x00007f169389ddcb in ?? () from /root/go/src/lib//libonnxruntime.so
#17 0x00007f169389fa33 in ?? () from /root/go/src/lib//libonnxruntime.so
#18 0x00007f16938a01fc in ?? () from /root/go/src/lib//libonnxruntime.so
#19 0x00007f16930dff6c in ?? () from /root/go/src/lib//libonnxruntime.so
#20 0x00007f169306ef67 in ?? () from /root/go/src/lib//libonnxruntime.so
#21 0x0000000001673d56 in _cgo_de0f6483b9ae_Cfunc_RunOrtSession (v=0xc00cceb098) at cgo-gcc-prolog:431
#22 0x000000000047c624 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:821
#23 0x000000c031b244e0 in ?? ()
#24 0x0000000000000004 in ?? ()
#25 0x000000c00cceaff8 in ?? ()
#26 0x000000000047eb86 in time.now () at /usr/local/go/src/runtime/time_linux_amd64.s:52
#27 0x000000000a256179 in ?? ()
#28 0x00007f177f5b2f6f in ?? ()
#29 0x0000000000800000 in ?? () at :1
#30 0x0000000000000000 in ?? ()

Thread 68 (Thread 0x7f15517fe700 (LWP 312851)):
#0 0x00007f15bb70c823 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007f15bb469206 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007f15bb7ef2bf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007f15bb7eff6f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f15bb484bf7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007f15bb518928 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007f15bb5b9063 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007f15d75c3717 in ?? () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcublas.so.11
#8 0x00007f15d75f3f15 in ?? () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcublas.so.11
#9 0x00007f15d6c87bfc in ?? () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcublas.so.11
#10 0x00007f15d6c89010 in ?? () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcublas.so.11
#11 0x00007f15d6c87539 in ?? () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcublas.so.11
#12 0x00007f15d6d4defa in cublasDestroy_v2 () from /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcublas.so.11
#13 0x00007f17381d6159 in onnxruntime::CudaStream::~CudaStream() () from /root/go/src/lib/libonnxruntime_providers_tensorrt.so
#14 0x00007f17381d624d in onnxruntime::CudaStream::~CudaStream() () from /root/go/src/lib/libonnxruntime_providers_tensorrt.so
#15 0x00007f169380d60a in ?? () from /root/go/src/lib//libonnxruntime.so
#16 0x00007f169380d811 in ?? () from /root/go/src/lib//libonnxruntime.so
#17 0x00007f16930eb959 in ?? () from /root/go/src/lib//libonnxruntime.so
#18 0x00007f16930ee804 in ?? () from /root/go/src/lib//libonnxruntime.so
#19 0x00007f16930eed7d in ?? () from /root/go/src/lib//libonnxruntime.so
#20 0x000000000047c624 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:821
#21 0x0000000000452aed in runtime.park_m (gp=0xc0001dc340) at /usr/local/go/src/runtime/proc.go:3336
#22 0x000000c0036aa4e0 in ?? ()
#23 0x000000c0001dc340 in ?? ()
#24 0x0000000000000000 in ?? ()
`

Questions

  1. Is onnxruntime 1.15.1 compatible with cuda 11.2 ?
  2. Is the Release OrtSession concurrent safe? In our case it is that an old Ort Session release runs at the same time a newer Ort Session object runs a session ?

To reproduce

No

Urgency

No

Platform

Linux

OS Version

ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

C

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 8.6.1, Cuda 11.2

Model File

No response

Is this a quantized model?

No

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Oct 13, 2023
@tianleiwu
Copy link
Contributor

tianleiwu commented Oct 17, 2023

@lfch, Could you upgrade CUDA to 11.8 to see whether the issue is still there? We have not tested CUDA 11.2, which is quite old.

If you need help, please share some test code/script and model/data that could reproduce the issue. Otherwise, other people cannot trouble shoot the issue.

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 16, 2023
Copy link
Contributor

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

2 participants