-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node #21413
Comments
To save time, the |
It is a known issue that some operators are not thread-safe. Like Attention and MultiHeadAttention used in whisper encoder is not thread safe. You may try set an environment variable Another walkaround is to use different session per thread, that could avoid thread safe issue, but probably won't help performance since multiple session competing the same GPU resource. What's the reason to use use multiple threading for same session? Usually have no performance benefit. Instead, you can try increasing batch size to see whether it could increase throughput. |
Setting those environment variables did not help sadly. I still see similar errors, but also new ones that look even worse:
I definitely want to use the same session so the model is only loaded once into GPU memory. It is true batching can be used here instead and I'll look into that. I was initially looking at threading since the whisper model pre-processing, which reads the audio data and converts it into the appropriate format (padding, then doing log mel spectrogram conversion) is CPU work only which could be parallelised. With a batching approach this pre-processing work will be serialised. The documents I read indicated that the ORT session is thread-safe. Since this is not the case, is there a list of operators documented somewhere that are not? |
@david-sitsky, you may try use multiple threading for pre-processing in CPU to see whether it helps. The new error indicates that other places have thread-safe issue in CUDA provider. It may take time to nail the root cause. A related older issue: #18806 |
### Description - [x] Rewrite FusedMHARunnerFP16v2 to make it thread-safe. - [x] Add multi-threading tests Previously, the kernel parameters params is stored as a member of mha runner, which means that different threads might change the params at the same time and impacts the other threads. For example, if batch_size and seq_len was changed by another thread to larger values in setup(...), buffer overrun might happen in run(...) because a kernel could read/write memory out of range of allocated buffers. In new implementation, I change the api and remove mutable member variables to make it thread safe. Below is summary of change: Before: ``` class FusedMHARunnerFP16v2::mhaImpl { void setup(int seq_len, int batch_size) { // change scalar params } void run(input, output) { // change params for input and output pointers // launch kernel using params } Fused_multihead_attention_params_v2 params; // mutable, not thread-safe } ``` After: ``` class FusedMHARunnerFP16v2::FmhaImpl { void setup(int seq_len, int batch_size, Fused_multihead_attention_params_v2& params) { // change params } void run(params, input, output) { // change params with input and output pointers // launch kernel using params } } ``` ### Motivation and Context #18854 #21413
@tianleiwu - any ideas with next steps with the thread-safe issue? I am using djl-serving, so sometimes the server will receive multiple unrelated requests from different clients on the same Whisper model, and I see the issue being hit pretty easily. I am using OnnxRuntime 1.17.3.
|
@david-sitsky, The BeamSearch operator is not thread safe since it has internal state. One way for serving is to have a queue for user's requests, and batch user's request to call onnxruntime inference to avoid multiple threading. I am not sure whether it is supported by bjl-serving. I could take a look at design change to make it thread safe. That might be targeted for 1.20 release (3+ months away). |
Many thanks @tianleiwu - that would be great! |
Describe the issue
I created a Whisper ONNX model using https://github.com/microsoft/Olive/blob/main/examples/whisper/README.md, specifically using these commands on a machine with a GPU:
In my application, I break up large audio files into chunks and then execute each chunk against the Whisper model for transcription using threads.
On a CPU machine (with a model generated on a non-GPU machine) this works fine. However on a GPU machine this fails.
To reproduce
On a g5.4xlarge instance in AWS, using Ubuntu 22.04, CUDA 11.8, Nvidia A10G GPU, the following program reproduces the error:
An example of an error run:
The program works fine when
THREAD_NUMBER=1
.The speech.wav file can be downloaded from https://resources.djl.ai/audios/speech.wav.
If it helps, I can try and put the model I generated somewhere, but it is 1.1G in size.
Urgency
This is a blocker for deploying our application so it is urgent. We are actually using the Java bindings of onnxruntime with https://github.com/deepjavalibrary/djl-serving which is hitting this very issue when it receives concurrent requests. I wrote the python program so that it is easier to reproduce, but it is exactly the same error message.
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu-1.18.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
The text was updated successfully, but these errors were encountered: