[js/web] BiasSplitGelu and BiasAdd kernels #17161

dakenf · 2023-08-15T13:52:47Z

Description

Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅

guschmue · 2023-08-16T16:28:11Z

/azp run ONNX Runtime Web CI Pipeline

azure-pipelines · 2023-08-16T16:28:21Z

Azure Pipelines successfully started running 1 pipeline(s).

js/web/docs/webgpu-operators.md

satyajandhyala · 2023-08-16T16:37:05Z

If applicable please add JSONC test cases to validate. For example see onnxruntime/js/web/test/data/ops/gelu.jsonc

satyajandhyala · 2023-08-16T16:39:39Z

Description

Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅

Can you share a (small) test case that demonstrates the problem?

dakenf · 2023-08-16T16:59:00Z

Can you share a (small) test case that demonstrates the problem?

Unfortunately no. I will make different builds with some JSEP kernels disabled to see which one causes the problem. Right now if i run it on webgpu, i get an image like this for any prompt. If i run it on CPU, everything works fine

satyajandhyala · 2023-08-16T17:48:01Z

Can you share a (small) test case that demonstrates the problem?

Unfortunately no. I will make different builds with some JSEP kernels disabled to see which one causes the problem. Right now if i run it on webgpu, i get an image like this for any prompt. If i run it on CPU, everything works fine

@dakenf It will help if you can you add steps to reproduce this problem?

js/web/lib/wasm/jsep/webgpu/ops/bias-split-gelu.ts

dakenf · 2023-08-17T01:24:14Z

@dakenf It will help if you can you add steps to reproduce this problem?

To reproduce the error you'll need to make 64bit build of runtime. So it's quite complicated

I've found two issues that gave incorrect results

MatMul did not support broadcasting, fixed here [js/web] MatMul broadcasting #17191
Conv kernel gives incorrect results for StableDiffusion unet. If i comment it out, everything works fine but slow. Not sure where to start as it passes all tests

fs-eire · 2023-08-17T06:55:03Z

@dakenf It will help if you can you add steps to reproduce this problem?

To reproduce the error you'll need to make 64bit build of runtime. So it's quite complicated

I've found two issues that gave incorrect results

MatMul did not support broadcasting, fixed here [js/web] MatMul broadcasting #17191

Conv kernel gives incorrect results for StableDiffusion unet. If i comment it out, everything works fine but slow. Not sure where to start as it passes all tests

could you help to dump a set of input/output sample (also the attributes) for the incorrect conv op so that we can take a look at it?

dakenf · 2023-08-17T14:58:00Z

could you help to dump a set of input/output sample (also the attributes) for the incorrect conv op so that we can take a look at it?

yeah. there are 65 conv nodes so i feel it is going to be fun

dakenf · 2023-08-18T16:09:17Z

I could not find an easy way to dump inputs/outputs as they are very big but found a solution to fix my issues: #17219

guschmue · 2023-08-18T16:14:30Z

ci is nagging, run 'npm run format'

dakenf · 2023-08-18T16:15:50Z

ci is nagging, run 'npm run format'

yup. will also add some tests and update the PR

guschmue · 2023-08-25T15:30:46Z

/azp run ONNX Runtime Web CI Pipeline

azure-pipelines · 2023-08-25T15:30:57Z

Azure Pipelines successfully started running 1 pipeline(s).

js/web/lib/wasm/jsep/webgpu/ops/bias-add.ts

js/web/lib/wasm/jsep/webgpu/ops/bias-split-gelu.ts

satyajandhyala · 2023-08-25T16:42:39Z

@dakenf Can you try running with the latest code. I merged fixes to ConvTranspose.

guschmue · 2023-08-25T17:00:53Z

build fails - some imports have changed underneath:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1121559&view=logs&j=53557da5-d670-5ecd-2062-8e26dc1e9095&t=0c31510a-b692-5618-10ba-6036ba471b24

dakenf · 2023-08-25T21:58:38Z

@dakenf Can you try running with the latest code. I merged fixes to ConvTranspose.

Yeah. Will don on the weekend. And this week emscripten got a release with all required changes (except 64bit threads) so i'll clean up 64bit PR

There's still a big issue with VRAM limit in chrome, it's ~16gb on windows. And StableDiffusion unet eats 10gb+, so it's not possible to fit unet and vae into VRAM. I'm trying to solve it with Attention+MultiHeadAttention ops and it went down to ~5gb but it does not give correct results. Most likely i've messed up with indices for packed weights or batched gemm/matmul. If i won't be able to fix it in a reasonable time, will make a draft PR and ask for your help

…with DML, CUDA and TensorrRT

dakenf · 2023-08-30T10:43:52Z

@dakenf Can you try running with the latest code. I merged fixes to ConvTranspose.

It seems fine now. However i'm experiencing some other issues, like getting weird images completely unrelated to prompt. But it happens for both wasm and webgpu EPs. Need to check whether the problem is in model optimizer, new emscripten release or my DNA

fs-eire · 2023-08-30T19:38:10Z

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline

fs-eire · 2023-08-30T19:38:12Z

/azp run Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2023-08-30T19:38:50Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2023-08-30T20:21:11Z

Azure Pipelines successfully started running 10 pipeline(s).

dakenf · 2023-09-01T11:04:19Z

However i'm experiencing some other issues, like getting weird images completely unrelated to prompt. But it happens for both wasm and webgpu EPs

This is a bug in python fusion optimizer. It breaks the text encoder model, so replacing it with original one fixes the issue. However, it is quite weird because when i use python code to generate images with that broken model, it works fine

Maybe will investigate more and fill a bug later. Want to focus on optimizing InstanceNorm/LayerNorm kernels as InstanceNorm has 3 for loops from 0 to 40k in each invocation. And then finally revisit Attention kernel with a fresh look

dakenf · 2023-09-01T15:15:42Z

BTW, if you are struggling with a choice of next OP to implement, can it be NhwcConv?

dakenf · 2023-09-02T22:31:47Z

With MultiHeadAttention (without packed weights), Attention with some vec2/vec4 optimizations and in-place SoftMax, InstanceNorm vec2/vec4 i've narrowed unet to 5.6gb VRAM and ~2.5sec for one step

Will apply same optimizations to LayerNorm/SkipLayerNorm and implement GroupNorm to see if it will speed it up. Maybe with GroupNorm VAE will use less than 7.5gb of VRAM

So you can expect a few more PRs next week with all these stuff

2023-09-03.02-01-07.mp4

dakenf · 2023-09-04T22:18:24Z

@fs-eire finally got shader-f16 extension working with latest chrome and --enable-dawn-features=allow_unsafe_apis

Since almost all OPs use indices helper, it will be an easy change (however i've seen some hardcoded var x = fp32(0))
If that would give 2x boost and if i'll manage to implement flash attention with packed weights, it all might go down from 2.5sec to less than 1. Feels like christmas, will do tests tomorrow

fs-eire · 2023-09-08T00:44:50Z

The part of 2 operators are good. could you please revert the changes for adding test support for several other EPs? This change is helpful but should be separated.

please mention me after this change, I will try to kick the CI asap.

dakenf · 2023-09-12T13:43:21Z

@fs-eire i've reverted test runner changes and added fp16 support

fs-eire · 2023-09-12T17:52:55Z

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline

fs-eire · 2023-09-12T17:52:57Z

/azp run Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2023-09-12T17:53:33Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2023-09-12T17:53:39Z

Azure Pipelines successfully started running 10 pipeline(s).

fs-eire · 2023-10-02T21:18:07Z

/azp run Windows x64 QNN CI Pipeline

azure-pipelines · 2023-10-02T21:18:18Z

Azure Pipelines successfully started running 1 pipeline(s).

fs-eire · 2023-10-02T21:27:55Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

fs-eire · 2023-10-02T21:27:57Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2023-10-02T21:28:35Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2023-10-02T21:28:38Z

Azure Pipelines successfully started running 10 pipeline(s).

### Description Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅 --------- Co-authored-by: Guenther Schmuelling <[email protected]> Co-authored-by: Yulong Wang <[email protected]>

dakenf added 2 commits August 15, 2023 17:41

BiasSplitGelu and BiasAdd

c48c601

webgpu-operators.md update

1e7a4bf

guschmue added the ep:WebGPU ort-web webgpu provider label Aug 16, 2023

satyajandhyala reviewed Aug 16, 2023

View reviewed changes

js/web/docs/webgpu-operators.md Outdated Show resolved Hide resolved

fs-eire reviewed Aug 16, 2023

View reviewed changes

js/web/lib/wasm/jsep/webgpu/ops/bias-split-gelu.ts Outdated Show resolved Hide resolved

Merge branch 'main' into jsep-bias

0f9d330

github-advanced-security bot found potential problems Aug 25, 2023

View reviewed changes

js/web/lib/wasm/jsep/webgpu/ops/bias-add.ts Fixed Show fixed Hide fixed

js/web/lib/wasm/jsep/webgpu/ops/bias-split-gelu.ts Fixed Show fixed Hide fixed

BiasAdd/BiasSplitGelu vec4 optimization, tests, ability to run tests …

ce3c407

…with DML, CUDA and TensorrRT

Doc and format fixes

92aeb75

BiasSplitGelu fix for batch size > 1

d3cbe9e

dakenf added 2 commits September 12, 2023 17:39

Merge branch 'upstream' into jsep-bias

232cd1a

BiasAdd/BiasSplitGelu fp16 support, reverted test runner changes

240faf4

fs-eire added 2 commits October 2, 2023 14:24

Merge remote-tracking branch 'origin/main' into jsep-bias

fe34383

fix conflict and build breaks

757cf26

fs-eire approved these changes Oct 2, 2023

View reviewed changes

satyajandhyala approved these changes Oct 3, 2023

View reviewed changes

guschmue approved these changes Oct 3, 2023

View reviewed changes

guschmue merged commit d0519a7 into microsoft:main Oct 3, 2023
63 of 65 checks passed

[js/web] BiasSplitGelu and BiasAdd kernels #17161

[js/web] BiasSplitGelu and BiasAdd kernels #17161

Conversation

dakenf commented Aug 15, 2023

Description

guschmue commented Aug 16, 2023

azure-pipelines bot commented Aug 16, 2023

satyajandhyala commented Aug 16, 2023

satyajandhyala commented Aug 16, 2023

Description

dakenf commented Aug 16, 2023

satyajandhyala commented Aug 16, 2023

dakenf commented Aug 17, 2023

fs-eire commented Aug 17, 2023

dakenf commented Aug 17, 2023

dakenf commented Aug 18, 2023

guschmue commented Aug 18, 2023

dakenf commented Aug 18, 2023

guschmue commented Aug 25, 2023

azure-pipelines bot commented Aug 25, 2023

satyajandhyala commented Aug 25, 2023

guschmue commented Aug 25, 2023

dakenf commented Aug 25, 2023

dakenf commented Aug 30, 2023

fs-eire commented Aug 30, 2023

fs-eire commented Aug 30, 2023

azure-pipelines bot commented Aug 30, 2023

azure-pipelines bot commented Aug 30, 2023

dakenf commented Sep 1, 2023

dakenf commented Sep 1, 2023

dakenf commented Sep 2, 2023

dakenf commented Sep 4, 2023

fs-eire commented Sep 8, 2023

dakenf commented Sep 12, 2023

fs-eire commented Sep 12, 2023

fs-eire commented Sep 12, 2023

azure-pipelines bot commented Sep 12, 2023

azure-pipelines bot commented Sep 12, 2023

fs-eire commented Oct 2, 2023

azure-pipelines bot commented Oct 2, 2023

fs-eire commented Oct 2, 2023

fs-eire commented Oct 2, 2023

azure-pipelines bot commented Oct 2, 2023

azure-pipelines bot commented Oct 2, 2023