[Web] Demucs model won't run in both WASM and WGPU #22031

gianlourbano · 2024-09-09T12:29:47Z

Describe the issue

I converted the model from pytorch to onnx as described here, with some issues. The model works in onnx python, but in wasm /webgpu the runtime dies without error. The optimized version of the model runs in wasm, but not webgpu. I don't know if this problem is related to the model conversion or the runtime. I have tested with both @latest and @dev.

To reproduce

Here's a link to a sample repo, instructions in README.

Urgency

Urgent, as this project is related to my thesis

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.2, 1.20.0-dev.20240907-ad9afbb042

Execution Provider

'wasm'/'cpu' (WebAssembly CPU), 'webgpu' (WebGPU)

gyagp · 2024-09-10T05:47:44Z

For WebGPU EP, the problem is related to op unsqueeze. According the ONNX spec (https://onnx.ai/onnx/operators/onnx__Unsqueeze.html), axes of unsqueeze is a list of integers, but in your model, it's just a scalar "1".

gianlourbano · 2024-09-10T07:00:55Z

So the problem is related to the dynamo export of torch?

fs-eire · 2024-09-11T01:03:27Z

Technically the axes should always be a 1D tensor. However, in reality, the CPU code has loosen the limit:

https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/tensor/unsqueeze.cc#L60-L62

perhaps webgpu should have same behavior to CPU.

#22054

This is to fix issue microsoft#22031 to run model demucs. For conv-transpose, outputPadding.length could be 1, while spatialRank is 2. The fix is to append enough 0s to outputPadding. For conv, the issue is similar. kernelShape.length sometimes could be 1, while inputs[1].dims.length is 4. The fix is also to append enough 0s to kernelShape.

This is to fix issue #22031 to run model demucs. For conv-transpose, outputPadding.length could be 1, while spatialRank is 2. The fix is to append enough 0s to outputPadding. For conv, the issue is similar. kernelShape.length sometimes could be 1, while inputs[1].dims.length is 4. The fix is also to append enough 0s to kernelShape.

gianlourbano · 2024-09-19T07:17:08Z

@gyagp with latest 1.20.0-dev.20240917-afd642a194, that should include both fixes, i still cannot run the model in webgpu, the runtime just aborts after displaying the wgpu experimental warning

gyagp · 2024-09-19T14:32:23Z

I also hit some issue with the latest code, and I will take a further look.
BTW, I manually modified the model to work around the unsqueeze issue before, and it seems that model can run. I uploaded it to https://huggingface.co/webai-community/models/tree/main (click "download file" after demucs.onnx).

gianlourbano · 2024-09-19T16:58:34Z

Your model succesfully runs with latest @dev, with timings (60s of audio with 10s chunks):

wasm:
step 0: 12656 ms
step 1: 12864 ms
step 2: 13211 ms
step 3: 13164 ms
step 4: 13643 ms
step 5: 13687 ms

wgpu:
step 0: 10226 ms
step 1: 9612 ms
step 2: 9628 ms
step 3: 9647 ms
step 4: 9600 ms
step 5: 9562 ms

onnx python cpu:
step 0: 4.9 s
step 1: 4.9 s
step 2: 4.6 s
step 3: 4.9 s
step 4: 4.8 s
step 5: 4.6 s

On ryzen 4600H

gianlourbano · 2024-09-20T13:57:27Z

I have also tried on a macbook m1 pro with an average wgpu step of ~2.8s

While allowing axes in unsqueeze to be scalar, its shape couldn't be always accessed like a vector. This PR fixes issue microsoft#22031 so that the original model could run well.

While allowing axes in unsqueeze to be scalar, its shape couldn't be always accessed like a vector. This PR fixes issue #22031 so that the original model could run well.

gianlourbano · 2024-10-18T14:18:11Z

@gyagp After implementening pre and post processing for the demuxing of a whole track, i have noticed that Wgpu outputs are way different from wasm ones. In wasm, the model works as expected, while in gpu the stems are all mixed up, apart from the bass one: i suspect the lower frequencies are preserved while with the higher ones something strange happens. Maybe an error in some kernel?

If you want i can upload somewhere the stems of a 10s chunks for wasm/wpgu inference, to see the difference. I'm certain the problem is not with the pre/post processing, as the outputs of the model with the two backends are different.

Also, any update on the MatMul problem?

gyagp · 2024-10-18T15:35:56Z

Sorry to hear that you got different results from wasm and WebGPU. If you may upload your case somewhere, I can take a look next week.
What's the MatMul problem?

gianlourbano · 2024-10-18T15:51:39Z

I'll update the sample repo in this issue so that it computes on the same random arrays both on wasm and wgpu, to demonstrate that the outputs are different based on the backend used.

The matmul problem is the one you mentioned here, i.e the performance of the wgpu model is not that great

gyagp · 2024-10-18T16:12:49Z

Ah, sorry that it's a bit buried by other tasks. I will ask someone from my team to look into it next week.

BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577

BUG #22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR #22577

gianlourbano · 2024-10-31T10:15:03Z

Thank you very much @qjia7 ! On my macbook pro m1 the step is now 1.9s from 2.8s. I'm still seeing wrong outputs for the model in wgpu, while on wasm it works fine. If you want i can upload some stems to the sample repo so you can see the difference

qjia7 · 2024-11-01T00:57:02Z

@gianlourbano Will look at the wrong outputs issue. And the optimization isn't over yet. There are still several places that need to be optimized.

qjia7 · 2024-11-01T06:25:16Z

@gianlourbano I did a debug for this model. The incorrect result is because the MatMul shader key is not unique which results the wrong compute pipeline is loaded. PR #22536 may fix the issue. You can have a try once this PR is merged.

### Description  BUG #22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>

BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs.

BUG #22031

While allowing axes in unsqueeze to be scalar, its shape couldn't be always accessed like a vector. This PR fixes issue microsoft#22031 so that the original model could run well.

BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577

### Description  BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>

BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description  ### Motivation and Context

…microsoft#22709) microsoft#22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.

BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

BUG #22031

gyagp · 2024-11-29T00:52:33Z

@gianlourbano, Google improved the perf via https://issues.chromium.org/issues/379009123. Could you please try the latest Chrome Canary to see the performance. @qjia7 and I may not have access to MacBook recently.

gianlourbano · 2024-11-29T02:53:58Z

@gyagp @qjia7 i can confirm that on canary the performance returned to normal, even faster i would say (avg step of 1.2 seconds). Thank you very much for your help!

…#22709) #22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.

BUG #22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

BUG #22031

BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577

### Description  BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>

BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description  ### Motivation and Context

…microsoft#22709) microsoft#22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.

BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

) BUG microsoft#22031

BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577

### Description  BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>

BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description  ### Motivation and Context

…microsoft#22709) microsoft#22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.

BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

) BUG microsoft#22031

### Description  BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>

) BUG microsoft#22031

fs-eire · 2024-12-18T23:29:38Z

Close the issue as corresponding fixes and features are merged and verified.

gianlourbano added the platform:web issues related to ONNX Runtime web; typically submitted using template label Sep 9, 2024

github-actions bot added the ep:WebGPU ort-web webgpu provider label Sep 9, 2024

gyagp mentioned this issue Sep 12, 2024

[js/webgpu] Fix issue to run model demucs #22074

Merged

gyagp mentioned this issue Sep 29, 2024

[js/webgpu] Fix the crash issue in unsqueeze #22264

Merged

qjia7 mentioned this issue Oct 24, 2024

[js/webgpu] Optimize MatMul with M = 1 #22577

Merged

qjia7 mentioned this issue Oct 29, 2024

[js/webgpu] Optimize InstanceNorm in some shapes #22637

Merged

qjia7 added a commit to qjia7/onnxruntime that referenced this issue Nov 4, 2024

[js/webgpu] Optimize Gemm

e494758

BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs.

qjia7 added a commit to qjia7/onnxruntime that referenced this issue Nov 4, 2024

[js/webgpu] Optimize Gemm

a894f00

BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs.

qjia7 mentioned this issue Nov 4, 2024

[js/webgpu] Optimize Gemm #22706

Merged

qjia7 mentioned this issue Nov 18, 2024

[js/webgpu] Optimize transpose as reshape when suitable #22870

Merged

guschmue pushed a commit that referenced this issue Nov 18, 2024

[js/webgpu] Optimize transpose as reshape when suitable (#22870)

e597eae

BUG #22031

ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024

[js/webgpu] Optimize ConvTranspose (microsoft#22774)

7e7a4f6

BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

mszhanyi pushed a commit that referenced this issue Nov 22, 2024

[js/webgpu] Optimize transpose as reshape when suitable (#22870)

e275a7f

BUG #22031

guschmue pushed a commit that referenced this issue Dec 2, 2024

[js/webgpu] Optimize ConvTranspose (#22774)

8143b00

BUG #22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

guschmue pushed a commit that referenced this issue Dec 2, 2024

[js/webgpu] Optimize transpose as reshape when suitable (#22870)

bff00ef

BUG #22031

ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024

[js/webgpu] Optimize ConvTranspose (microsoft#22774)

3569519

BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024

[js/webgpu] Optimize transpose as reshape when suitable (microsoft#22870

6e00749

) BUG microsoft#22031

ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024

[js/webgpu] Optimize ConvTranspose (microsoft#22774)

62883ee

BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.

ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024

[js/webgpu] Optimize transpose as reshape when suitable (microsoft#22870

ad21ea6

) BUG microsoft#22031

ankitm3k pushed a commit to intel/onnxruntime that referenced this issue Dec 11, 2024

[js/webgpu] Optimize transpose as reshape when suitable (microsoft#22870

6ada5d0

) BUG microsoft#22031

fs-eire closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Web] Demucs model won't run in both WASM and WGPU #22031

[Web] Demucs model won't run in both WASM and WGPU #22031

gianlourbano commented Sep 9, 2024

gyagp commented Sep 10, 2024

gianlourbano commented Sep 10, 2024

fs-eire commented Sep 11, 2024 •

edited

Loading

gianlourbano commented Sep 19, 2024

gyagp commented Sep 19, 2024

gianlourbano commented Sep 19, 2024 •

edited

Loading

gianlourbano commented Sep 20, 2024

gianlourbano commented Oct 18, 2024

gyagp commented Oct 18, 2024

gianlourbano commented Oct 18, 2024

gyagp commented Oct 18, 2024

gianlourbano commented Oct 31, 2024 •

edited

Loading

qjia7 commented Nov 1, 2024

qjia7 commented Nov 1, 2024

gyagp commented Nov 29, 2024

gianlourbano commented Nov 29, 2024

fs-eire commented Dec 18, 2024

[Web] Demucs model won't run in both WASM and WGPU #22031

[Web] Demucs model won't run in both WASM and WGPU #22031

Comments

gianlourbano commented Sep 9, 2024

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

gyagp commented Sep 10, 2024

gianlourbano commented Sep 10, 2024

fs-eire commented Sep 11, 2024 • edited Loading

gianlourbano commented Sep 19, 2024

gyagp commented Sep 19, 2024

gianlourbano commented Sep 19, 2024 • edited Loading

gianlourbano commented Sep 20, 2024

gianlourbano commented Oct 18, 2024

gyagp commented Oct 18, 2024

gianlourbano commented Oct 18, 2024

gyagp commented Oct 18, 2024

gianlourbano commented Oct 31, 2024 • edited Loading

qjia7 commented Nov 1, 2024

qjia7 commented Nov 1, 2024

gyagp commented Nov 29, 2024

gianlourbano commented Nov 29, 2024

fs-eire commented Dec 18, 2024

fs-eire commented Sep 11, 2024 •

edited

Loading

gianlourbano commented Sep 19, 2024 •

edited

Loading

gianlourbano commented Oct 31, 2024 •

edited

Loading