-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Web] Demucs model won't run in both WASM and WGPU #22031
Comments
For WebGPU EP, the problem is related to op unsqueeze. According the ONNX spec (https://onnx.ai/onnx/operators/onnx__Unsqueeze.html), axes of unsqueeze is a list of integers, but in your model, it's just a scalar "1". |
So the problem is related to the dynamo export of torch? |
Technically the axes should always be a 1D tensor. However, in reality, the CPU code has loosen the limit: perhaps webgpu should have same behavior to CPU. |
This is to fix issue microsoft#22031 to run model demucs. For conv-transpose, outputPadding.length could be 1, while spatialRank is 2. The fix is to append enough 0s to outputPadding. For conv, the issue is similar. kernelShape.length sometimes could be 1, while inputs[1].dims.length is 4. The fix is also to append enough 0s to kernelShape.
This is to fix issue #22031 to run model demucs. For conv-transpose, outputPadding.length could be 1, while spatialRank is 2. The fix is to append enough 0s to outputPadding. For conv, the issue is similar. kernelShape.length sometimes could be 1, while inputs[1].dims.length is 4. The fix is also to append enough 0s to kernelShape.
@gyagp with latest 1.20.0-dev.20240917-afd642a194, that should include both fixes, i still cannot run the model in webgpu, the runtime just aborts after displaying the wgpu experimental warning |
I also hit some issue with the latest code, and I will take a further look. |
Your model succesfully runs with latest @dev, with timings (60s of audio with 10s chunks): wasm: wgpu: onnx python cpu: On ryzen 4600H |
I have also tried on a macbook m1 pro with an average wgpu step of ~2.8s |
While allowing axes in unsqueeze to be scalar, its shape couldn't be always accessed like a vector. This PR fixes issue microsoft#22031 so that the original model could run well.
While allowing axes in unsqueeze to be scalar, its shape couldn't be always accessed like a vector. This PR fixes issue #22031 so that the original model could run well.
@gyagp After implementening pre and post processing for the demuxing of a whole track, i have noticed that Wgpu outputs are way different from wasm ones. In wasm, the model works as expected, while in gpu the stems are all mixed up, apart from the bass one: i suspect the lower frequencies are preserved while with the higher ones something strange happens. Maybe an error in some kernel? If you want i can upload somewhere the stems of a 10s chunks for wasm/wpgu inference, to see the difference. I'm certain the problem is not with the pre/post processing, as the outputs of the model with the two backends are different. Also, any update on the MatMul problem? |
Sorry to hear that you got different results from wasm and WebGPU. If you may upload your case somewhere, I can take a look next week. |
I'll update the sample repo in this issue so that it computes on the same random arrays both on wasm and wgpu, to demonstrate that the outputs are different based on the backend used. The matmul problem is the one you mentioned here, i.e the performance of the wgpu model is not that great |
Ah, sorry that it's a bit buried by other tasks. I will ask someone from my team to look into it next week. |
BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577
Thank you very much @qjia7 ! On my macbook pro m1 the step is now 1.9s from 2.8s. I'm still seeing wrong outputs for the model in wgpu, while on wasm it works fine. If you want i can upload some stems to the sample repo so you can see the difference |
@gianlourbano Will look at the wrong outputs issue. And the optimization isn't over yet. There are still several places that need to be optimized. |
@gianlourbano I did a debug for this model. The incorrect result is because the |
### Description <!-- Describe your changes. --> BUG #22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>
BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs.
BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs.
While allowing axes in unsqueeze to be scalar, its shape couldn't be always accessed like a vector. This PR fixes issue microsoft#22031 so that the original model could run well.
BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577
### Description <!-- Describe your changes. --> BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>
BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…microsoft#22709) microsoft#22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.
@gianlourbano, Google improved the perf via https://issues.chromium.org/issues/379009123. Could you please try the latest Chrome Canary to see the performance. @qjia7 and I may not have access to MacBook recently. |
BUG #22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.
BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577
### Description <!-- Describe your changes. --> BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>
BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…microsoft#22709) microsoft#22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.
BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577
### Description <!-- Describe your changes. --> BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>
BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…microsoft#22709) microsoft#22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.
### Description <!-- Describe your changes. --> BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>
Close the issue as corresponding fixes and features are merged and verified. |
Describe the issue
I converted the model from pytorch to onnx as described here, with some issues. The model works in onnx python, but in wasm /webgpu the runtime dies without error. The optimized version of the model runs in wasm, but not webgpu. I don't know if this problem is related to the model conversion or the runtime. I have tested with both @latest and @dev.
To reproduce
Here's a link to a sample repo, instructions in README.
Urgency
Urgent, as this project is related to my thesis
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.2, 1.20.0-dev.20240907-ad9afbb042
Execution Provider
'wasm'/'cpu' (WebAssembly CPU), 'webgpu' (WebGPU)
The text was updated successfully, but these errors were encountered: