-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[js/web] BiasSplitGelu and BiasAdd kernels #17161
Conversation
/azp run ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
If applicable please add JSONC test cases to validate. For example see onnxruntime/js/web/test/data/ops/gelu.jsonc |
Can you share a (small) test case that demonstrates the problem? |
@dakenf It will help if you can you add steps to reproduce this problem? |
To reproduce the error you'll need to make 64bit build of runtime. So it's quite complicated I've found two issues that gave incorrect results
|
could you help to dump a set of input/output sample (also the attributes) for the incorrect conv op so that we can take a look at it? |
yeah. there are 65 conv nodes so i feel it is going to be fun |
I could not find an easy way to dump inputs/outputs as they are very big but found a solution to fix my issues: #17219 |
ci is nagging, run 'npm run format' |
yup. will also add some tests and update the PR |
/azp run ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
@dakenf Can you try running with the latest code. I merged fixes to ConvTranspose. |
build fails - some imports have changed underneath: |
Yeah. Will don on the weekend. And this week emscripten got a release with all required changes (except 64bit threads) so i'll clean up 64bit PR There's still a big issue with VRAM limit in chrome, it's ~16gb on windows. And StableDiffusion unet eats 10gb+, so it's not possible to fit unet and vae into VRAM. I'm trying to solve it with Attention+MultiHeadAttention ops and it went down to ~5gb but it does not give correct results. Most likely i've messed up with indices for packed weights or batched gemm/matmul. If i won't be able to fix it in a reasonable time, will make a draft PR and ask for your help |
…with DML, CUDA and TensorrRT
It seems fine now. However i'm experiencing some other issues, like getting weird images completely unrelated to prompt. But it happens for both wasm and webgpu EPs. Need to check whether the problem is in model optimizer, new emscripten release or my DNA |
/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline |
/azp run Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
This is a bug in python fusion optimizer. It breaks the text encoder model, so replacing it with original one fixes the issue. However, it is quite weird because when i use python code to generate images with that broken model, it works fine Maybe will investigate more and fill a bug later. Want to focus on optimizing InstanceNorm/LayerNorm kernels as InstanceNorm has 3 |
BTW, if you are struggling with a choice of next OP to implement, can it be NhwcConv? |
With MultiHeadAttention (without packed weights), Attention with some vec2/vec4 optimizations and in-place SoftMax, InstanceNorm vec2/vec4 i've narrowed unet to 5.6gb VRAM and ~2.5sec for one step Will apply same optimizations to LayerNorm/SkipLayerNorm and implement GroupNorm to see if it will speed it up. Maybe with GroupNorm VAE will use less than 7.5gb of VRAM So you can expect a few more PRs next week with all these stuff 2023-09-03.02-01-07.mp4 |
@fs-eire finally got shader-f16 extension working with latest chrome and Since almost all OPs use indices helper, it will be an easy change (however i've seen some hardcoded |
The part of 2 operators are good. could you please revert the changes for adding test support for several other EPs? This change is helpful but should be separated. please mention me after this change, I will try to kick the CI asap. |
@fs-eire i've reverted test runner changes and added fp16 support |
/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline |
/azp run Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
/azp run Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline |
Azure Pipelines successfully started running 8 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
### Description Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅 --------- Co-authored-by: Guenther Schmuelling <[email protected]> Co-authored-by: Yulong Wang <[email protected]>
### Description Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅 --------- Co-authored-by: Guenther Schmuelling <[email protected]> Co-authored-by: Yulong Wang <[email protected]>
Description
Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md
However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅