Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to profile my super slow webNN implementation (sd-turbo img2img) #40

Closed
eyaler opened this issue Sep 23, 2024 · 4 comments
Closed

Comments

@eyaler
Copy link

eyaler commented Sep 23, 2024

I am trying to make a webNN example for sd-turbo image-to-image: https://github.com/eyaler/webnn-developer-preview/blob/main/demos/sd-turbo/index.js
I used the vae encoder from: https://huggingface.co/schmuell/sd-turbo-ort-web/ without any changes to the model. I also tried other ones, but this worked for me where others did not.
I probably didn't do the latents sampling as intended, but it is working. You can try it here: https://eyaler.github.io/webnn-developer-preview/demos/sd-turbo/ you need to tick the image-to-image checkbox, and it uses a default input image I provided (sorry). My main issue is the encoder takes 100x-300x the time compared to the original flow without the encoder (40-120 sec compare to 400 ms). You can compare by unticking the image-to-image checkbox. This suggests to me that something is very wrong with the way I hooked up the model, or that I missed some basic adjustment steps. I would be grateful for any insights on potentially obvious reasons for the large discrepancy and how to approach debugging this.

@ibelem
Copy link
Contributor

ibelem commented Sep 29, 2024

@eyaler Great work and idea for image-to-image usage!

The inputs of the vae encoder you are using:

sample / name: sample
tensor: float16[batch_size,num_channels,height,width]

But your code of override is freeDimensionOverrides: { batch: 1, channels: 3, height: 512, width: 512 },
in https://github.com/eyaler/webnn-developer-preview/blob/main/demos/sd-turbo/index.js#L316

Please try freeDimensionOverrides: { batch_size: 1, num_channels: 3, height: 512, width: 512 }, at first?

https://huggingface.co/schmuell/sd-turbo-ort-web/blob/main/vae_encoder/config.json

CC @Honry

@eyaler
Copy link
Author

eyaler commented Sep 29, 2024

@ibelem Oh Wow! fixing the argument names makes it 100x time faster! Thanks!!

However, now that the optimization is kicking in I have saturation issues that I guess are related to float16 casting. My next steps:

  1. While the large performance hit makes it clear that optimizations were not working - I am not sure why the free dimensions would be connected with casting issues. Maybe everything stays on the CPU? I played with graphOptimizationLevel and indeed it seems that if i don't completely fix the free dimensions - that is equivalent to disabling all graph optimizations - which surprised me. Indeed there are use cases where dimensions are not fixed. Are these cases not ready for WebNN on GPU?
  2. I will try the instance normalization cast fix mentioned in another issue and if it works - will try to make a script, as it seems to be quite a verbose and non-trivial process to do.
  3. I plan to open an issue on onnx-runtime to suggest that wrong arguments names in free dimensions override will raise an error or at least a warning. I never want my models to run 100x slower due to a typo or wrong name, and I could not find a warning.
  4. Also, initial tests show that the fixed 1.19.0-dev.20240804-ee2fe87e2d onnx-runtime version of the original demo is significantly faster than both 1.19.2stable and 1.20.0-dev.20240928-1bda91fc57. Is this expected? if not I will follow up with an onnx-runtime issue

@ibelem
Copy link
Contributor

ibelem commented Sep 30, 2024

@eyaler Great to know the perf improved!

WebNN EP needs specify fixed integer values via freeDimensionOverrides for all the symbolic dimensions (https://onnxruntime.ai/docs/tutorials/web/env-flags-and-session-options.html#freedimensionoverrides), otherwise the optimizations will not be applied.

I plan to open an issue on onnx-runtime to suggest that wrong arguments names in free dimensions override will raise an error or at least a warning.

Please file a bug to microsoft/onnxruntime :)

Initial tests show that the fixed 1.19.0-dev.20240804-ee2fe87e2d onnx-runtime version of the original demo is significantly faster than both 1.19.2stable and 1.20.0-dev.20240928-1bda91fc57.

Please use single model to run the tests and provide detailed performance data among these ORT dists, then we can check what happened in newer ORT versions. Thanks a lot!

@eyaler
Copy link
Author

eyaler commented Oct 3, 2024

@ibelem Thanks!

For throwing an error on bad names and emphasizing in the docs the need to fix all dimensions, i opened microsoft/onnxruntime#22300

My saturation issue has been solved by fixing the VAE encoder instance normalization as discussed, and i put the fixed model here: vae encoder. I made a helper script based on onnx2text here: https://github.com/eyaler/webnn-developer-preview/blob/main/demos/sd-turbo/fix_instance_norm.py - perhaps not generic enough, but may help give people the direction needed

I am actually seeing ~10x run time inconsistencies even with repeat inference with the same library version. Specifically, I see the VAE encoder or UNET getting slower after a hitting the generate button a few times in my web demo. I will investigate further.

@eyaler eyaler closed this as completed Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants