Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Remove some transpose ops in layout conversion. #18128

Open
qjia7 opened this issue Oct 27, 2023 · 8 comments
Open

[Performance] Remove some transpose ops in layout conversion. #18128

qjia7 opened this issue Oct 27, 2023 · 8 comments
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform quantization issues related to quantization stale issues that have not been addressed in a while; categorized by a bot

Comments

@qjia7
Copy link
Contributor

qjia7 commented Oct 27, 2023

Describe the issue

I see many subgraph like below in sd-vae-decoder.onnx.
sd-vae-decoder
The sequence is like: NCHW data -> Conv -> Reshape -> InstanceNormalization -> Reshape-> ...
when converted to NHWC layout, it becomes

NCHW data -> Tranpose(toNHWC) -> Conv(NHWC) -> Transpose(toNCHW) -> Reshape(NCHW) -> Tranpose(toNHWC) -> InstanceNormalization(NHWC) -> Transpose(toNCHW) -> Reshape(NCHW) -> ...

I hope the Transpose ops before Reshape op can be optimized out. After the optimization, the converted sequence will be like:

NCHW data -> Tranpose(toNHWC) -> Conv(NHWC) -> Reshape(NHWC) -> InstanceNormalization(NHWC) -> Reshape(NHWC) -> ...

With this optimization, it can greatly reduce the inserted Transpose ops. And I think it's doable for the Reshape since it's flexible to adjust the Reshape's shape.

To reproduce

  1. Download the model here https://huggingface.co/aislamov/stable-diffusion-2-1-base-onnx/tree/9f697c96d42e5c09437ff14b0a2b287366ce488d/vae_decoder
  2. Use netron open above model.
  3. See the graph structure.

Urgency

No response

Platform

Windows

OS Version

Win11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

538e97c

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Other / Unknown

Execution Provider Library Version

JSEP

Model File

No response

Is this a quantized model?

Yes

@github-actions github-actions bot added model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform quantization issues related to quantization labels Oct 27, 2023
@qjia7
Copy link
Contributor Author

qjia7 commented Oct 27, 2023

@skottmckay Do you want to look at this issue? Or you can guide me where to do this change since I am not familiar with ORT's layout transformation.

@skottmckay
Copy link
Contributor

Can you set the optimized model path in the session options and see what that model looks like? That is saved after the optimizations are done so should show where Transpose nodes got stuck.

https://onnxruntime.ai/docs/api/c/struct_ort_api.html#ad238e424200c0f1682947a1f342c39ca

@tianleiwu
Copy link
Contributor

tianleiwu commented Oct 27, 2023

@skottmckay, I tried it in CPU and CUDA EP (so it might be different from webgpu).

Optimized graph for CPU EP:
image
It seems that in CPU, the Conv is in com.microsoft.nchwc domain (I am curious why it does not use NHWC domain). And the InstanceNormalization is still in onnx domain (NCHW). I think we have both Conv and InstanceNormalization in com.microsoft.nhwc domain:

REGISTER_NHWC_SCHEMA_WITH_ACTIVATION(fn, InstanceNormalization, 6);
REGISTER_NHWC_SCHEMA_WITH_ACTIVATION(fn, Conv, 11);

Optimized graph for CUDA EP:
image

I think the Transpose nodes in "Transpose-->QuickGelu--> Transpose" shall be removed since QuickGelu is element-wise operator. Also, the other Transpose node can be constant-folding in above graph.

@skottmckay
Copy link
Contributor

@tianleiwu The CPU EP prefers NCHW. The OP is using JSEP which prefers NHWC (bringing the layout transformer into play).

On CPU InstanceNormalization is in NCHWc if certain AVX instructions are available. The NCHWc transformer was added a long time ago and we had no NHWC support back then. I don't know if NHWC should be preferred over NCHWc either in general or when the AVX instructions are available.

The transpose optimizer can't push Transpose nodes through operators it knows nothing about. We can add a handler for QuickGelu fairly easily though using other element-wise ops as an example.

Add to this using probably HandleSimpleNode or HandleSimpleNodeBroadcast

const HandlerMap& OrtExtendedHandlers() {
static const HandlerMap extended_handler_map = []() {
HandlerMap map = {
{"MaxPool", max_pool_op_handler},
{"Resize", ep_aware_resize_handler},
{"com.microsoft.QuantizeLinear", contrib_quantize_dequantize_linear_handler},
{"com.microsoft.DequantizeLinear", contrib_quantize_dequantize_linear_handler},
{"com.microsoft.QLinearAdd", q_linear_binary_op_handler},
{"com.microsoft.QLinearAveragePool", q_linear_pool_op_handler},
{"com.microsoft.QLinearConcat", q_linear_concat_handler},
{"com.microsoft.QLinearGlobalAveragePool", q_linear_pool_op_handler},
{"com.microsoft.QLinearLeakyRelu", node_1_inp_handler},
{"com.microsoft.QLinearMul", q_linear_binary_op_handler},
{"com.microsoft.QLinearReduceMean", reduce_op_handler},
{"com.microsoft.QLinearSigmoid", node_1_inp_handler},
};

@skottmckay
Copy link
Contributor

The issue with this model is the Reshape nodes converting between 4D and 3D values.

e.g. After the 2 initial Conv nodes there's a Reshape from 4D to 3D prior to the InstanceNormalization. That prevents the Transpose between NCHW and NHWC being pushed through as the rank is changing which invalidates the transpose perms.

image

Is there a reason for the Reshape around the InstanceNormalization nodes? According to the ONNX spec the input can be 4D and normalization is per-channel.

https://github.com/onnx/onnx/blob/main/docs/Operators.md#InstanceNormalization

The ORT implementation would flatten the inner dimensions from the look of it, so potentially the Reshape nodes aren't changing the results but are blocking transpose optimization.

const int64_t W = input->Shape().SizeFromDimension(2);

@qjia7
Copy link
Contributor Author

qjia7 commented Oct 31, 2023

cc @dakenf @guschmue

Is there a reason for the Reshape around the InstanceNormalization nodes?

I got this model from @dakenf and the f16 one from @guschmue. Both of them have the same pattern that Reshape around the InstanceNormalization node. I don't know why it's designed like this. But since it's a valid usage, should we optimize them even the Reshape is used like this?

@skottmckay
Copy link
Contributor

The Reshape is actually changing the size of the second dim from the 512 in the Conv output to 32, so it's not just flattening the inner dimensions. That makes it impossible to make compatible with the perms in the Transpose.

Not sure there's a good way to optimize in a generic manner. The GetCapability implementation of the EP could possibly 'fuse' the Transpose -> InstanceNormalization(NHWC) -> Transpose back to the original NCHW InstanceNormalization (there's a second call to GetCapability after layout transformation to ensure the EP takes all the new nodes) but that seems quite model specific because it requires the Transpose from NHWC back to NCHW to be blocked on the next node, and for the EP to have both NCHW and NHWC implementations of the operator. If there was something else between the Transpose and the blocking Reshape the pattern would change and it's no longer a 'Transpose -> NHWC op -> Transpose' sequence that can be simple fused back to the original node.

I don't think you can do anything much in the layout transformation when it converts the node to the NHWC domain and wraps it in Transpose nodes because you need transpose optimization to run to know where those Transpose nodes end up and whether they are able to be removed or not. The transpose optimizer runs after all the layout transformation changes complete.

Copy link
Contributor

github-actions bot commented Jan 3, 2024

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform quantization issues related to quantization stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

3 participants