[Performance] Remove some transpose ops in layout conversion. #18128

qjia7 · 2023-10-27T06:42:42Z

Describe the issue

I see many subgraph like below in sd-vae-decoder.onnx.

The sequence is like: NCHW data -> Conv -> Reshape -> InstanceNormalization -> Reshape-> ...
when converted to NHWC layout, it becomes

NCHW data -> Tranpose(toNHWC) -> Conv(NHWC) -> Transpose(toNCHW) -> Reshape(NCHW) -> Tranpose(toNHWC) -> InstanceNormalization(NHWC) -> Transpose(toNCHW) -> Reshape(NCHW) -> ...

I hope the Transpose ops before Reshape op can be optimized out. After the optimization, the converted sequence will be like:

NCHW data -> Tranpose(toNHWC) -> Conv(NHWC) -> Reshape(NHWC) -> InstanceNormalization(NHWC) -> Reshape(NHWC) -> ...

With this optimization, it can greatly reduce the inserted Transpose ops. And I think it's doable for the Reshape since it's flexible to adjust the Reshape's shape.

To reproduce

Download the model here https://huggingface.co/aislamov/stable-diffusion-2-1-base-onnx/tree/9f697c96d42e5c09437ff14b0a2b287366ce488d/vae_decoder
Use netron open above model.
See the graph structure.

Urgency

No response

Platform

Windows

OS Version

Win11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

538e97c

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Other / Unknown

Execution Provider Library Version

JSEP

Model File

No response

Is this a quantized model?

Yes

The text was updated successfully, but these errors were encountered:

qjia7 · 2023-10-27T06:50:13Z

@skottmckay Do you want to look at this issue? Or you can guide me where to do this change since I am not familiar with ORT's layout transformation.

skottmckay · 2023-10-27T09:14:40Z

Can you set the optimized model path in the session options and see what that model looks like? That is saved after the optimizations are done so should show where Transpose nodes got stuck.

https://onnxruntime.ai/docs/api/c/struct_ort_api.html#ad238e424200c0f1682947a1f342c39ca

tianleiwu · 2023-10-27T17:03:39Z

@skottmckay, I tried it in CPU and CUDA EP (so it might be different from webgpu).

Optimized graph for CPU EP:

It seems that in CPU, the Conv is in com.microsoft.nchwc domain (I am curious why it does not use NHWC domain). And the InstanceNormalization is still in onnx domain (NCHW). I think we have both Conv and InstanceNormalization in com.microsoft.nhwc domain:

onnxruntime/onnxruntime/core/graph/contrib_ops/internal_nhwc_onnx_schemas.cc

Lines 107 to 109 in 2eeafc3

    
           REGISTER_NHWC_SCHEMA_WITH_ACTIVATION(fn, InstanceNormalization, 6); 
        
           REGISTER_NHWC_SCHEMA_WITH_ACTIVATION(fn, Conv, 11);

Optimized graph for CUDA EP:

I think the Transpose nodes in "Transpose-->QuickGelu--> Transpose" shall be removed since QuickGelu is element-wise operator. Also, the other Transpose node can be constant-folding in above graph.

skottmckay · 2023-10-27T21:27:37Z

@tianleiwu The CPU EP prefers NCHW. The OP is using JSEP which prefers NHWC (bringing the layout transformer into play).

On CPU InstanceNormalization is in NCHWc if certain AVX instructions are available. The NCHWc transformer was added a long time ago and we had no NHWC support back then. I don't know if NHWC should be preferred over NCHWc either in general or when the AVX instructions are available.

The transpose optimizer can't push Transpose nodes through operators it knows nothing about. We can add a handler for QuickGelu fairly easily though using other element-wise ops as an example.

Add to this using probably HandleSimpleNode or HandleSimpleNodeBroadcast

onnxruntime/onnxruntime/core/optimizer/transpose_optimization/ort_transpose_optimization.cc

Lines 136 to 151 in d9695de

    
           const HandlerMap& OrtExtendedHandlers() { 
        
             static const HandlerMap extended_handler_map = []() { 
        
               HandlerMap map = { 
        
                   {"MaxPool", max_pool_op_handler}, 
        
                   {"Resize", ep_aware_resize_handler}, 
        
                   {"com.microsoft.QuantizeLinear", contrib_quantize_dequantize_linear_handler}, 
        
                   {"com.microsoft.DequantizeLinear", contrib_quantize_dequantize_linear_handler}, 
        
                   {"com.microsoft.QLinearAdd", q_linear_binary_op_handler}, 
        
                   {"com.microsoft.QLinearAveragePool", q_linear_pool_op_handler}, 
        
                   {"com.microsoft.QLinearConcat", q_linear_concat_handler}, 
        
                   {"com.microsoft.QLinearGlobalAveragePool", q_linear_pool_op_handler}, 
        
                   {"com.microsoft.QLinearLeakyRelu", node_1_inp_handler}, 
        
                   {"com.microsoft.QLinearMul", q_linear_binary_op_handler}, 
        
                   {"com.microsoft.QLinearReduceMean", reduce_op_handler}, 
        
                   {"com.microsoft.QLinearSigmoid", node_1_inp_handler}, 
        
               };

skottmckay · 2023-10-31T08:14:10Z

The issue with this model is the Reshape nodes converting between 4D and 3D values.

e.g. After the 2 initial Conv nodes there's a Reshape from 4D to 3D prior to the InstanceNormalization. That prevents the Transpose between NCHW and NHWC being pushed through as the rank is changing which invalidates the transpose perms.

Is there a reason for the Reshape around the InstanceNormalization nodes? According to the ONNX spec the input can be 4D and normalization is per-channel.

https://github.com/onnx/onnx/blob/main/docs/Operators.md#InstanceNormalization

The ORT implementation would flatten the inner dimensions from the look of it, so potentially the Reshape nodes aren't changing the results but are blocking transpose optimization.

onnxruntime/onnxruntime/core/providers/cpu/nn/instance_norm.cc

Line 26 in 1c25fe5

const int64_t W = input->Shape().SizeFromDimension(2);

qjia7 · 2023-10-31T08:34:07Z

cc @dakenf @guschmue

Is there a reason for the Reshape around the InstanceNormalization nodes?

I got this model from @dakenf and the f16 one from @guschmue. Both of them have the same pattern that Reshape around the InstanceNormalization node. I don't know why it's designed like this. But since it's a valid usage, should we optimize them even the Reshape is used like this?

skottmckay · 2023-11-01T00:00:53Z

The Reshape is actually changing the size of the second dim from the 512 in the Conv output to 32, so it's not just flattening the inner dimensions. That makes it impossible to make compatible with the perms in the Transpose.

Not sure there's a good way to optimize in a generic manner. The GetCapability implementation of the EP could possibly 'fuse' the Transpose -> InstanceNormalization(NHWC) -> Transpose back to the original NCHW InstanceNormalization (there's a second call to GetCapability after layout transformation to ensure the EP takes all the new nodes) but that seems quite model specific because it requires the Transpose from NHWC back to NCHW to be blocked on the next node, and for the EP to have both NCHW and NHWC implementations of the operator. If there was something else between the Transpose and the blocking Reshape the pattern would change and it's no longer a 'Transpose -> NHWC op -> Transpose' sequence that can be simple fused back to the original node.

I don't think you can do anything much in the layout transformation when it converts the node to the NHWC domain and wraps it in Transpose nodes because you need transpose optimization to run to know where those Transpose nodes end up and whether they are able to be removed or not. The transpose optimizer runs after all the layout transformation changes complete.

github-actions · 2024-01-03T15:01:06Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions bot added model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform quantization issues related to quantization labels Oct 27, 2023

qjia7 mentioned this issue Oct 27, 2023

[js/webgpu] Optimize NCHW layout for InstanceNormalization #18123

Merged

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Remove some transpose ops in layout conversion. #18128

[Performance] Remove some transpose ops in layout conversion. #18128

qjia7 commented Oct 27, 2023

qjia7 commented Oct 27, 2023 •

edited

Loading

skottmckay commented Oct 27, 2023

tianleiwu commented Oct 27, 2023 •

edited

Loading

skottmckay commented Oct 27, 2023

skottmckay commented Oct 31, 2023

qjia7 commented Oct 31, 2023

skottmckay commented Nov 1, 2023

github-actions bot commented Jan 3, 2024

[Performance] Remove some transpose ops in layout conversion. #18128

[Performance] Remove some transpose ops in layout conversion. #18128

Comments

qjia7 commented Oct 27, 2023

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

qjia7 commented Oct 27, 2023 • edited Loading

skottmckay commented Oct 27, 2023

tianleiwu commented Oct 27, 2023 • edited Loading

skottmckay commented Oct 27, 2023

skottmckay commented Oct 31, 2023

qjia7 commented Oct 31, 2023

skottmckay commented Nov 1, 2023

github-actions bot commented Jan 3, 2024

qjia7 commented Oct 27, 2023 •

edited

Loading

tianleiwu commented Oct 27, 2023 •

edited

Loading