New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ConvTranpose using CUDNN Frontend with NHWC support #21752

Merged

tianleiwu merged 5 commits into microsoft:main from JTischbein:implement-nwhc-convtranspose

Sep 10, 2024

Contributor

JTischbein commented Aug 15, 2024

Description

Added CUDNN Frontend and used it for NHWC ConvTranspose op including option for bias fusion. Similar to this Conv PR

Backward compatible

If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used.

Major Changes

For cuDNN 9, we will enable cudnn frontend to fuse data gradient convolution and bias when a provider option fuse_conv_bias=1.

Potential Issues

cuDNN frontend uses TF32 by default. It can be disabled using use_tf32 cuda provider option, but in the case cuDNN frontend encounters issues building an operation graph it will fallback to using TF32.

Follow ups

This is one of the PRs that target to enable NHWC, here the ConvTranspose operation in CUDA EP by default if device supports it. There are other changes will follow up to make it possible.
(1) Enable prefer_nhwc by default for device with sm >= 70.
(2) Change fuse_conv_bias=1 by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).

Motivation and Context

The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution data gradient operation (ConvTranspose) with the pointwise bias operation.

Minor Change

In the CUDA convolution operation was a small bug when GetCudnnConv1dPadToNc1d was enabled.


          ConvTranpose using CUDNN Frontend with NHWC support and small fix in …

48b2750

…conv.cc

Contributor

tianleiwu commented Aug 15, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

Contributor

tianleiwu commented Aug 15, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines bot commented Aug 15, 2024

Azure Pipelines successfully started running 9 pipeline(s).

github-advanced-security bot found potential problems

View reviewed changes

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                  // see Conv<T, NHWC>::UpdateState in /onnxruntime/core/providers/cuda/nn/conv.cc for more details.
+                  if (cuda_ep->GetCudnnConv1dPadToNc1d()) {
+                    // add fake H dimension
+                    const auto insert_at = NHWC ? 1 : 2;

Check warning

Code scanning / PREfast

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                  // see Conv<T, NHWC>::UpdateState in /onnxruntime/core/providers/cuda/nn/conv.cc for more details.
+                  if (cuda_ep->GetCudnnConv1dPadToNc1d()) {
+                    // add fake H dimension
+                    const auto insert_at = NHWC ? 1 : 2;

Check warning

Code scanning / PREfast

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                  // see Conv<T, NHWC>::UpdateState in /onnxruntime/core/providers/cuda/nn/conv.cc for more details.
+                  if (cuda_ep->GetCudnnConv1dPadToNc1d()) {
+                    // add fake H dimension
+                    const auto insert_at = NHWC ? 1 : 2;

Check warning

Code scanning / PREfast

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                    w_dims.insert(w_dims.begin() + insert_at, 1);
+                  } else {
+                    // add fake W dimension
+                    const auto insert_at = NHWC ? 2 : 3;

Check warning

Code scanning / PREfast

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                    w_dims.insert(w_dims.begin() + insert_at, 1);
+                  } else {
+                    // add fake W dimension
+                    const auto insert_at = NHWC ? 2 : 3;

Check warning

Code scanning / PREfast

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                    w_dims.insert(w_dims.begin() + insert_at, 1);
+                  } else {
+                    // add fake W dimension
+                    const auto insert_at = NHWC ? 2 : 3;

Check warning

Code scanning / PREfast

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'insert_at' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                    ConvTransposeAttributes::Prepare p;
+                    // PrePack moves the M/group dimension of W to the end, with 'M' being interpreted as 'output channels'
+                    const bool transposed_input_channels = false;

Check warning

Code scanning / PREfast

The const variable 'transposed_input_channels' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'transposed_input_channels' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                    ConvTransposeAttributes::Prepare p;
+                    // PrePack moves the M/group dimension of W to the end, with 'M' being interpreted as 'output channels'
+                    const bool transposed_input_channels = false;

Check warning

Code scanning / PREfast

The const variable 'transposed_input_channels' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'transposed_input_channels' can be computed at compile-time. Consider using constexpr (con.5).

onnxruntime/core/providers/cuda/nn/conv_transpose_8.h

+                    ConvTransposeAttributes::Prepare p;
+                    // PrePack moves the M/group dimension of W to the end, with 'M' being interpreted as 'output channels'
+                    const bool transposed_input_channels = false;

Check warning

Code scanning / PREfast

The const variable 'transposed_input_channels' can be computed at compile-time. Consider using constexpr (con.5). Warning

The const variable 'transposed_input_channels' can be computed at compile-time. Consider using constexpr (con.5).


          Adding [[maybe_unused]] to logger in ConvTransposeNeedFallbackToCPU a…

cfd7da7

…nd Linting

Contributor

tianleiwu commented Aug 16, 2024

/azp run Linux GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

azure-pipelines bot commented Aug 16, 2024

Azure Pipelines successfully started running 2 pipeline(s).


          Adding node as maybe_unused, due to MSVC error

b5a4816

Contributor

tianleiwu commented Aug 20, 2024

@JTischbein, there were some test errors in pipelines. Did you have chance to take a look?

Contributor Author

JTischbein commented Aug 21, 2024

@JTischbein, there were some test errors in pipelines. Did you have chance to take a look?

The test errors were not reproducible for me. We are currently testing other GPUs, I will keep you updated. For the build error I will add another [[maybe_unused]] and push it now.

Contributor

tianleiwu commented Aug 30, 2024

/azp run Linux GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

azure-pipelines bot commented Aug 30, 2024

Azure Pipelines successfully started running 2 pipeline(s).


          Bringing back conv transpose fallback in case of asymmetric padding

60a0484

Contributor

tianleiwu commented Sep 9, 2024

/azp run Linux GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

azure-pipelines bot commented Sep 9, 2024

Azure Pipelines successfully started running 2 pipeline(s).


          ConvTranspose fix: save y_dims for next run

3e1a2d1

Contributor

tianleiwu commented Sep 10, 2024

/azp run Linux GPU CI Pipeline, Windows GPU TensorRT CI Pipeline

azure-pipelines bot commented Sep 10, 2024

Azure Pipelines successfully started running 2 pipeline(s).

Contributor

tianleiwu commented Sep 10, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU TensorRT CI Pipeline

Contributor

tianleiwu commented Sep 10, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

Contributor

tianleiwu commented Sep 10, 2024

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines bot commented Sep 10, 2024

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines bot commented Sep 10, 2024

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines bot commented Sep 10, 2024

Azure Pipelines successfully started running 10 pipeline(s).

tianleiwu approved these changes

View reviewed changes

tianleiwu reviewed

View reviewed changes

onnxruntime/core/providers/cuda/nn/conv_transpose.cc Show resolved Hide resolved

tianleiwu merged commit 20d9464 into microsoft:main

79 of 81 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet