-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Remove some transpose ops in layout conversion. #18128
Comments
@skottmckay Do you want to look at this issue? Or you can guide me where to do this change since I am not familiar with ORT's layout transformation. |
Can you set the optimized model path in the session options and see what that model looks like? That is saved after the optimizations are done so should show where Transpose nodes got stuck. https://onnxruntime.ai/docs/api/c/struct_ort_api.html#ad238e424200c0f1682947a1f342c39ca |
@skottmckay, I tried it in CPU and CUDA EP (so it might be different from webgpu). Optimized graph for CPU EP: onnxruntime/onnxruntime/core/graph/contrib_ops/internal_nhwc_onnx_schemas.cc Lines 107 to 109 in 2eeafc3
I think the Transpose nodes in "Transpose-->QuickGelu--> Transpose" shall be removed since QuickGelu is element-wise operator. Also, the other Transpose node can be constant-folding in above graph. |
@tianleiwu The CPU EP prefers NCHW. The OP is using JSEP which prefers NHWC (bringing the layout transformer into play). On CPU InstanceNormalization is in NCHWc if certain AVX instructions are available. The NCHWc transformer was added a long time ago and we had no NHWC support back then. I don't know if NHWC should be preferred over NCHWc either in general or when the AVX instructions are available. The transpose optimizer can't push Transpose nodes through operators it knows nothing about. We can add a handler for QuickGelu fairly easily though using other element-wise ops as an example. Add to this using probably HandleSimpleNode or HandleSimpleNodeBroadcast onnxruntime/onnxruntime/core/optimizer/transpose_optimization/ort_transpose_optimization.cc Lines 136 to 151 in d9695de
|
The issue with this model is the Reshape nodes converting between 4D and 3D values. e.g. After the 2 initial Conv nodes there's a Reshape from 4D to 3D prior to the InstanceNormalization. That prevents the Transpose between NCHW and NHWC being pushed through as the rank is changing which invalidates the transpose Is there a reason for the Reshape around the InstanceNormalization nodes? According to the ONNX spec the input can be 4D and normalization is per-channel. https://github.com/onnx/onnx/blob/main/docs/Operators.md#InstanceNormalization The ORT implementation would flatten the inner dimensions from the look of it, so potentially the Reshape nodes aren't changing the results but are blocking transpose optimization.
|
I got this model from @dakenf and the f16 one from @guschmue. Both of them have the same pattern that Reshape around the InstanceNormalization node. I don't know why it's designed like this. But since it's a valid usage, should we optimize them even the Reshape is used like this? |
The Reshape is actually changing the size of the second dim from the 512 in the Conv output to 32, so it's not just flattening the inner dimensions. That makes it impossible to make compatible with the perms in the Transpose. Not sure there's a good way to optimize in a generic manner. The GetCapability implementation of the EP could possibly 'fuse' the Transpose -> InstanceNormalization(NHWC) -> Transpose back to the original NCHW InstanceNormalization (there's a second call to GetCapability after layout transformation to ensure the EP takes all the new nodes) but that seems quite model specific because it requires the Transpose from NHWC back to NCHW to be blocked on the next node, and for the EP to have both NCHW and NHWC implementations of the operator. If there was something else between the Transpose and the blocking Reshape the pattern would change and it's no longer a 'Transpose -> NHWC op -> Transpose' sequence that can be simple fused back to the original node. I don't think you can do anything much in the layout transformation when it converts the node to the NHWC domain and wraps it in Transpose nodes because you need transpose optimization to run to know where those Transpose nodes end up and whether they are able to be removed or not. The transpose optimizer runs after all the layout transformation changes complete. |
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Describe the issue
I see many subgraph like below in sd-vae-decoder.onnx.
The sequence is like:
NCHW data -> Conv -> Reshape -> InstanceNormalization -> Reshape-> ...
when converted to NHWC layout, it becomes
I hope the Transpose ops before Reshape op can be optimized out. After the optimization, the converted sequence will be like:
With this optimization, it can greatly reduce the inserted Transpose ops. And I think it's doable for the Reshape since it's flexible to adjust the Reshape's
shape
.To reproduce
Urgency
No response
Platform
Windows
OS Version
Win11
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
538e97c
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Other / Unknown
Execution Provider Library Version
JSEP
Model File
No response
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: