Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have trouble using sherpa-onnx-offline-websocket-server with cuda provider #1053

Closed
Vergissmeinicht opened this issue Jun 24, 2024 · 19 comments

Comments

@Vergissmeinicht
Copy link

Vergissmeinicht commented Jun 24, 2024

I follow the instruction from (https://k2-fsa.github.io/sherpa/onnx/websocket/offline-websocket.html ) to start a non-streaming websocket server of transducer models. It works well with the client as well. But when I try to run the client in multithread, which means, several thread using websocket client to recognize wav files one by one in the same time, server raises cuda error:

2024-06-24 09:47:01.083093543 [E:onnxruntime:, cuda_call.cc:116 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=a2d9f82c2221 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=408 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_)); 2024-06-24 09:47:01.083005575 [E:onnxruntime:, cuda_call.cc:116 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=a2d9f82c2221 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle())); terminate called after throwing an instance of 'Ort::Exception' what(): CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=a2d9f82c2221 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=408 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_)); Aborted
My server runs on GeForce RTX 4090 / driver 535.104.05 / CUDA version: 12.2.

Glad to have your help.

@csukuangfj
Copy link
Collaborator

Does the server work fine when you use CPU?

@Vergissmeinicht
Copy link
Author

Yes, it works fine when using cpu provider.

@csukuangfj
Copy link
Collaborator

Could you tell us how you start the server?
Please post the full command.

@Vergissmeinicht
Copy link
Author

CUDA_VISIBLE_DEVICES=2 ./bin/sherpa-onnx-offline-websocket-server --provider=cuda --port=6006 --num-work-threads=10 --tokens=sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt --encoder=sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.onnx --decoder=sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx --joiner=sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.onnx --log-file=./log.txt --max-batch-size=5

@csukuangfj
Copy link
Collaborator

Could you change

lock.unlock();
// Note: DecodeStreams is thread-safe
recognizer_.DecodeStreams(p_ss.data(), size);

to

 recognizer_.DecodeStreams(p_ss.data(), size); 

 lock.unlock(); 

recompile, and re-try?

@Vergissmeinicht
Copy link
Author

It works fine now. So is it a bug here?

@csukuangfj
Copy link
Collaborator

It works fine now. So is it a bug here?

I think it is a bug of onnxruntime.

When using CPU, onnxruntime session is thread-safe.
However, it is not thread-safe when using CUDA provider.

please see
microsoft/onnxruntime#114

@manickavela29
Copy link
Contributor

Hi @csukuangfj,

is the issue coming because @Vergissmeinicht is using local onnxruntime,
with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

@Vergissmeinicht
Copy link
Author

Hi @csukuangfj,

is the issue coming because @Vergissmeinicht is using local onnxruntime, with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

I build sherpa with no local onnxruntime. The installation of onnxruntime is provided by the cmake.

@Vergissmeinicht
Copy link
Author

Vergissmeinicht commented Jun 26, 2024

@csukuangfj Server been running with 200k wav files recognized, everything works fine except that memory consumption seems to increase by nearly 3G. No more modification to source code. Is it possible that memory leak may happen?

@csukuangfj
Copy link
Collaborator

Is CPU RAM or GPU RAM increased to 3G?

Do you mean 20 000 wavs or just 200 wav files?

@csukuangfj
Copy link
Collaborator

Hi @csukuangfj,

is the issue coming because @Vergissmeinicht is using local onnxruntime, with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

@Vergissmeinicht Could you look into this comment?

@Vergissmeinicht
Copy link
Author

Is CPU RAM or GPU RAM increased to 3G?

Do you mean 20 000 wavs or just 200 wav files?

Been serving for 2days and now the memory consumption keeps stable. No more worry about memory leak! : )

@Vergissmeinicht
Copy link
Author

Hi @csukuangfj,
is the issue coming because @Vergissmeinicht is using local onnxruntime, with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

@Vergissmeinicht Could you look into this comment?

Replied to this comment already. I build the whole project inside a docker without any onnxruntime installed.

@csukuangfj
Copy link
Collaborator

Are you also running sherpa-onnx inside the docker container?

@Vergissmeinicht
Copy link
Author

Are you also running sherpa-onnx inside the docker container?

Yes. I use nvidia/cuda:11.1.1-cudnn8-devel-ubuntu20.04 as my base docker.

@csukuangfj
Copy link
Collaborator

Can it be closed now?

@Vergissmeinicht
Copy link
Author

Can it be closed now?

So it makes no difference whether the recognizer do decode after or before the unlock?

@csukuangfj
Copy link
Collaborator

For the CUDA provider, since onnxruntime.session is not thread-safe, we have to do decode first, and then unlock.

For the CPU provider, onnxruntime.session is thread-safe, so we can unlock first and then decode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants