[Web] Stable Diffusion Inpainting FP16 UNET outputs NANs #22983

jdp8 · 2024-12-02T19:13:56Z

Describe the issue

I converted stable-diffusion-inpainting and stable-diffusion-2-inpainting to FP16 ONNX format using both the optimum-cli export command and this script. The models work fine in Python ONNX Runtime but in ONNX Runtime Web, the UNET outputs NANs for some unknown reason as shown below:

The code running the models in the browser was translated to JavaScript from the pipeline script and the ONNX pipeline script and I'm pretty sure that my code is correct but I could be wrong. The shapes are as expected as ORT Web does not complain about this.

Does anybody have any idea what could be causing these NANs in the UNET? Could this be an issue of the model conversion or my code? Any assistance with this will be greatly appreciated as I have tried pretty much all I can think of.

Additional Context

There is a Nearest Neighbor Interpolation resize done in this pipeline which I achieved using OpenCV.js like so:

const maskCV = cv.matFromArray(width, height, cv.CV_32FC1, maskCondition);
const interpolatedMask = new cv.Mat();
const dSize = new cv.Size(height / this.vaeScaleFactor, width / this.vaeScaleFactor);
cv.resize(maskCV, interpolatedMask, dSize, 0, 0, cv.INTER_NEAREST);

To reproduce

To quickly reproduce the issue I guess that the UNET of any of my converted models can be loaded in an Inference Session and an object of 3 random inputs can be passed as input to the Session. The object consists of the following entries:

{
sample: shape [2, 9, 64, 64] | type float32,
timestep: shape [1] | type float32,
encoder_hidden_states: shape [2, 77, 1024] | type float32
}

Some of the models I have converted are:

ONNX Stable Diffusion Inpainting 2 FP16.
ONNX Stable Diffusion Inpainting 2 FP16 with opset 20.
ONNX Stable Diffusion Inpainting 2 FP16 with upcasted Attention.
Many more models which can be found in my HuggingFace Repo.

Urgency

Somewhat urgent.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.20.0

Execution Provider

'webgpu' (WebGPU)

fs-eire · 2024-12-02T23:51:45Z

@jdp8 thank you for reporting the issue. could you please share the repro steps (including the JavaScript code)? A jsfiddle link would also be good.

jdp8 · 2024-12-06T15:09:42Z

@fs-eire sorry for the delay. I made a simple jsfiddle that runs the Stable Diffusion Inpainting model for 1 step and prints the UNET output which is a Tensor filled with NANs. I left it up to that point as to not complicate the code more. The code was heavily inspired by the SD Turbo ORT Web example code.

Repro Steps

Convert a Stable Diffusion Inpainting model using either of the conversion scripts that I mentioned in the initial post. I have many converted inpainting models which can be found in my HuggingFace Repo in case you want to use them.
Translate the Stable Diffusion Inpainting code from Python to JavaScript (this is in the jsfiddle). The code can be taken either from here or here but I took more inspiration from the ONNX pipeline.

Other Info

There were challenges due to JavaScript not having certain features such as array broadcasting and a Nearest Neighbor Interpolation. The array broadcasting was implemented in the getMaskedImage() function and OpenCV.js was used for the Nearest Neighbor Interpolation.
The base image and mask image are already included in the code as base64 strings. The images and the example that I'm trying to run in the browser were taken from here (the first one).

Let me know if you have any questions or if I left something out. Thank you!

fs-eire · 2024-12-10T00:26:59Z

I am investigating this issue.

fs-eire · 2024-12-12T19:35:40Z

#23085 should have fixed the NaN issue, but not sure if there are other issues that blocks SD running.

@jdp8 Please allow one or two days for the pipeline to publish a new dev package and try it again.

jdp8 · 2024-12-13T19:04:03Z

@fs-eire Thank you! I'll try it tomorrow and let you know.

jdp8 · 2024-12-17T20:51:36Z

@fs-eire Sorry for the late response. Just tried it and the UNET is no longer outputting NaNs. Thank you so much!
I still have some NaNs appearing after the UNET is called (more or less happens after 8 steps) but that's something that I'll look into that's probably an error in my code.

Thank you once again!

### Description  Fix a bug caused by potential out-of-bound reads of `W` in the Conv2DMatMul shader. ### Motivation and Context Fixes #22983

jdp8 added the platform:web issues related to ONNX Runtime web; typically submitted using template label Dec 2, 2024

github-actions bot added api:Javascript issues related to the Javascript API ep:WebGPU ort-web webgpu provider labels Dec 2, 2024

jdp8 changed the title ~~Stable Diffusion Inpainting FP16 UNET outputs NANs [Web]~~ [Web] Stable Diffusion Inpainting FP16 UNET outputs NANs Dec 2, 2024

fs-eire mentioned this issue Dec 11, 2024

[js/webgpu] fix Conv2DMatMul shader's out-of-bound read #23085

Merged

fs-eire closed this as completed in #23085 Dec 12, 2024

fs-eire closed this as completed in 01539ee Dec 12, 2024

fs-eire reopened this Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Web] Stable Diffusion Inpainting FP16 UNET outputs NANs #22983

[Web] Stable Diffusion Inpainting FP16 UNET outputs NANs #22983

jdp8 commented Dec 2, 2024

fs-eire commented Dec 2, 2024 •

edited

Loading

jdp8 commented Dec 6, 2024

fs-eire commented Dec 10, 2024

fs-eire commented Dec 12, 2024

jdp8 commented Dec 13, 2024

jdp8 commented Dec 17, 2024

[Web] Stable Diffusion Inpainting FP16 UNET outputs NANs #22983

[Web] Stable Diffusion Inpainting FP16 UNET outputs NANs #22983

Comments

jdp8 commented Dec 2, 2024

Describe the issue

Additional Context

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

fs-eire commented Dec 2, 2024 • edited Loading

jdp8 commented Dec 6, 2024

Repro Steps

Other Info

fs-eire commented Dec 10, 2024

fs-eire commented Dec 12, 2024

jdp8 commented Dec 13, 2024

jdp8 commented Dec 17, 2024

fs-eire commented Dec 2, 2024 •

edited

Loading