Make CUDA a NHWC EP #17200

gedoensmax · 2023-08-17T19:01:08Z

Description

CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW.

Motivation and Context

Especially for convolutional networks this is very important. I will illustrate this using a very simple network:

import torch
import torch.nn as nn

class Net1(nn.Module):

    def __init__(self):
        super(Net1, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.m = nn.ModuleList([
            nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
        ])
    def forward(self, x):
        for module in self.m:
            x = module(x)
        return x


if __name__ == "__main__":
    dtype = torch.half
    device = "cuda"

    dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
    model = Net1().to(dtype=dtype, device=device)
    input_names = ["input1"]
    output_names = ["output1"]
    torch.onnx.export(model, dummy_input, "test.onnx",
                      input_names=input_names, output_names=output_names)

I profiled the launch of ./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx using sys and nvtx ranges.
Current master launches below kernels:

If I add the introduced -l flag we see below kernels:

Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The op_generic_tensor_kernel shows the bias used which should also be optimized out next.

Measured across some very basic models:

CUDA EP	NCHW [ms]	NHWC [ms]	Speedup
	-e cuda -t 5 -q	-e cuda -t 5 -q -l
resnet101-v2-7_bs8_fp16	18.33	13.07	1.4
resnet101-v2-7_bs8	21.8	12.06	1.81
test	102.07	73.62	1.39
Average speedup: 1.53

Outlook

Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain.

gedoensmax · 2023-08-17T19:16:56Z

@skottmckay in case you have some feedback or ideas on unit testing this transition or even allowing a "partial NHWC" provider. But I think that's not possible as the layout transformer asks for the whole provider rather than for each op right ?

benchmark.py

onnxruntime/core/providers/cpu/nn/conv_transpose_attributes.h

onnxruntime/core/providers/cpu/nn/instance_norm_helper.h

onnxruntime/test/perftest/ort_test_session.cc

jywu-msft · 2023-08-17T21:42:35Z

+@pranavsharma, +@hariharans29, +@tianleiwu, +@yufenglee FYI

onnxruntime/contrib_ops/cuda/fused_conv.cc

onnxruntime/core/providers/cuda/nn/conv_transpose.cc

onnxruntime/contrib_ops/cuda/fused_conv.cc

hariharans29 · 2023-08-18T19:57:43Z

@skottmckay in case you have some feedback or ideas on unit testing this transition or even allowing a "partial NHWC" provider. But I think that's not possible as the layout transformer asks for the whole provider rather than for each op right ?

Are you wondering how to unit test an NHWC CUDA op ?
Just thinking out aloud, let us say for testing purpose, you create a single node model (eg Conv), we should be able to instantiate a session for the same model in both NCHW (default for CUDA EP) and NHWC (with your new provider option set). I am guessing for the NHWC, we should see a Transpose before and a Transpose after inserted by the layout transformer. If we provide an input to the NCHW session and provide the same input to the NHWC session, we should get back the same results in both cases. Would this work to unit test a single op ?

onnxruntime/test/perftest/ort_test_session.cc

hariharans29 · 2023-08-18T20:37:01Z

Adding a comment to track documenting this feature here once this lands. Sample PR: #10859

hariharans29 · 2023-08-18T20:47:54Z

Measured across some very basic models:

CUDA EP NCHW [ms] NHWC [ms] Speedup
-e cuda -t 5 -q -e cuda -t 5 -q -l
resnet101-v2-7_bs8_fp16 18.33 13.07 1.4
resnet101-v2-7_bs8 21.8 12.06 1.81
test 102.07 73.62 1.39
Average speedup: 1.53

Just curious - if NHWC is needed to leverage tensor cores better, how are we seeing the speedup for resnet101-v2-7_bs8 (which I assume is fp32) while using NHWC when compared to NCHW ? Tensor cores are primarliy used for fp16 MMA right ?

twoapples1 · 2023-08-21T08:01:08Z

@skottmckay in case you have some feedback or ideas on unit testing this transition or even allowing a "partial NHWC" provider. But I think that's not possible as the layout transformer asks for the whole provider rather than for each op right ?

Hello, i have some questions about this test. It is undeniable that convolution calculations with NHWC data format are faster than with NCHW data format. However, in deep learning models, convolution is not the only operator, and there are other operators such as upsample and maxpool. Considering that onnx currently only supports the NCHW data format, I think there are two methods for reference:
The first method is to perform data conversion from NHWC to NCHW before and after convolution, and use the NHWC data format for convolution calculation, just like the first test. However, frequent data conversion like this would inevitably incur time overhead;
The second method is to convert the calculation of the entire graph into NHWC data format. In this case, only the first operator and the last operator would require data format conversion However,we need to add the corresponding calculation method for NHWC data format for each operator, and when parse the onnx model, we need to pay attention to the weight parameter layout in the onnx model. It's a heavy work. Are both of these methods correct? Based on the test, do you intend to choose the second method?

gedoensmax · 2023-08-21T12:09:58Z

session for the same model in both NCHW (default for CUDA EP) and NHWC (with your new provider option set)

@hariharans29 exactly that is my plan and I think with the link you sent me on teams:

onnxruntime/onnxruntime/test/contrib_ops/attention_op_test.cc

Lines 2015 to 2049 in d65aa54

    
           TEST(AttentionTest, AttentionPastState_dynamic) { 
        
             // ORT enables TF32 in GEMM for A100. TF32 will cause precsion loss and fail this test. 
        
             // Do not run this test unless TF32 is disabled explicitly. 
        
             if (HasCudaEnvironment(800) && ParseEnvironmentVariableWithDefault<int>("NVIDIA_TF32_OVERRIDE", 1) != 0) { 
        
               GTEST_SKIP() << "Skipping AttentionPastState_dynamic in A100 since TF32 is enabled"; 
        
               return; 
        
             } 
        
             // create rand inputs 
        
             RandomValueGenerator random{}; 
        
             std::vector<int64_t> input_dims{2, 5, 768}; 
        
             std::vector<float> input_data = random.Gaussian<float>(input_dims, 0.0f, 0.3f); 
        
             std::vector<int64_t> weight_dims{768, 2304}; 
        
             std::vector<float> weight_data = random.Gaussian<float>(weight_dims, 0.0f, 0.3f); 
        
             std::vector<int64_t> bias_dims{2304}; 
        
             std::vector<float> bias_data = random.Gaussian<float>(bias_dims, 0.0f, 0.3f); 
        
             std::vector<int64_t> past_dims{2, 2, 12, 15, 64}; 
        
             std::vector<float> past_data = random.Gaussian<float>(past_dims, 0.0f, 0.3f); 
        
             OpTester test("Attention", 1, onnxruntime::kMSDomain); 
        
             test.AddAttribute<int64_t>("num_heads", 12); 
        
             test.AddAttribute<int64_t>("unidirectional", 1); 
        
             test.AddInput<float>("input", input_dims, input_data); 
        
             test.AddInput<float>("weight", weight_dims, weight_data); 
        
             test.AddInput<float>("bias", bias_dims, bias_data); 
        
             test.AddOptionalInputEdge<int32_t>(); 
        
             test.AddInput<float>("past", past_dims, past_data); 
        
             test.AddReferenceOutputs("testdata/attention_past_state.onnx", 0.005f); 
        
             test.Run(); 
        
           }

that should easily be possible.

Also thanks for the other comments like workspace and conv algo - I think I have to make this passing via -I more general anyway.

gedoensmax · 2023-08-21T12:26:06Z

@twoapples1 Ideally we want to transition all ops to NHWC to have as little conversions as possible inside the model. As you see from the nsys traces that I shared as a screen shot, currently the NCHW to NHWC conversions and back can happen every single node !
So yes my intention is to convert all ops to NHWC and transpose weights once upon loading a network.

gedoensmax · 2023-08-21T12:35:36Z

Just curious - if NHWC is needed to leverage tensor cores better, how are we seeing the speedup for resnet101-v2-7_bs8 (which I assume is fp32) while using NHWC when compared to NCHW ? Tensor cores are primarliy used for fp16 MMA right ?

You are absolutely right, but as I tested on an Ada series GPU I also get to enjoy TF32 acceleration. On a Turing series NHWC is actually a little slower than NCHW, but not because of the kernels, but because FusedConv will be used for NCHW and for NHWC it is not selected.

hariharans29 · 2023-08-21T21:05:04Z

Just curious - if NHWC is needed to leverage tensor cores better, how are we seeing the speedup for resnet101-v2-7_bs8 (which I assume is fp32) while using NHWC when compared to NCHW ? Tensor cores are primarliy used for fp16 MMA right ?

You are absolutely right, but as I tested on an Ada series GPU I also get to enjoy TF32 acceleration. On a Turing series NHWC is actually a little slower than NCHW, but not because of the kernels, but because FusedConv will be used for NCHW and for NHWC it is not selected.

It seems like CuDNN has better implementations in general for Conv with NHWC (irrespective of the data type) ?

skottmckay · 2023-08-21T23:41:20Z

@skottmckay Scott McKay FTE in case you have some feedback or ideas on unit testing this transition or even allowing a "partial NHWC" provider. But I think that's not possible as the layout transformer asks for the whole provider rather than for each op right ?

One option would be to register another EP that shares a lot of the implementation with the existing EP but asks for NHWC layout. If it is higher priority to the existing EP it would get first chance to request nodes and have them converted to NWHC. Remaining nodes could be taken by the existing EP. Probably slightly confusing to have 2 CUDA EPs though so depends on whether that will be a short term or long term situation. May also cause other complexities to have 2 EPs (e.g. would it try to synchronize between nodes assigned different EPs) that would take time to work though.

Alternatively the EP interface could be expanded to try and support this. Not clear how that could/should look though. Short term it may be better to manually add the handling in layout_transformation.cc until we see if any other EPs would ever need this. Could maybe expand the GetEPLayoutSensitiveOps in this draft PR. May have time to get back to that PR in the next couple of weeks. If you removed operators from the layout sensitive set for the EP the layout transformation would not convert them to NHWC. So the EP would say NHWC is its preferred layout, but only a subset of layout sensitive nodes would be converted.

onnxruntime/core/providers/cuda/nn/conv.cc

skottmckay · 2023-08-29T02:00:10Z

Alternatively the EP interface could be expanded to try and support this. Not clear how that could/should look though. Short term it may be better to manually add the handling in layout_transformation.cc until we see if any other EPs would ever need this. Could maybe expand the GetEPLayoutSensitiveOps in this draft PR.

It might be better to abstract this out a little in TransformLayoutForEP and instead have a function like bool ConvertLayoutForNode(const Node& node) instead of directly doing a lookup of OpType() in the set of layout sensitive ops like we do currently:

onnxruntime/onnxruntime/core/optimizer/layout_transformation/layout_transformation.cc

Lines 69 to 75 in 38ea8c3

    
           const auto& layout_sensitive_ops = GetORTLayoutSensitiveOps(); 
        
           // to convert to NHWC we need to wrap layout sensitive nodes to Transpose from NCHW to NHWC and back. 
        
           for (auto& node : api_graph->Nodes()) { 
        
             if (layout_sensitive_ops.count(node->OpType())) { 
        
               if (node->GetExecutionProviderType() != execution_provider.Type()) { 
        
                 continue;

The default implementation of ConvertLayoutForNode can do the existing lookup in the set of layout sensitive ops, but that also gives a place where EP specific things can be plugged in. In the case of the CUDA EP it can return true for the subset of layout sensitive nodes that it wants converted. Passing in the Node allows you to check the op type, the domain and the EP it is assigned to.

I'd start with this EP specific logic being in layout_transformation.cc, but if necessary (i.e. more EPs need to control this behavior) we could update the EP API to allow the EP to optionally provide a ConvertLayoutForNode delegate.

gedoensmax · 2023-08-29T23:35:56Z

It might be better to abstract this out a little in TransformLayoutForEP and instead have a function like bool ConvertLayoutForNode(const Node& node) instead of directly doing a lookup of OpType() in the set of layout sensitive ops like we do currently:

A big +1 in my opinion as it would also allow for an easier transition from one layout to the other. I am open for discussions to help out with that but for now I would say I'll probably leave the design of such an API to the core CUDA EP devs here.

onnxruntime/test/perftest/ort_test_session.cc

onnxruntime/test/providers/cuda/nhwc/nhwc_cuda_helper.h

onnxruntime/test/contrib_ops/fused_conv_test.cc

onnxruntime/core/optimizer/conv_activation_fusion.cc

onnxruntime/contrib_ops/cuda/fused_conv.cc

hariharans29 · 2023-10-12T20:41:43Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

hariharans29 · 2023-10-12T20:41:53Z

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, ONNX Runtime React Native CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2023-10-12T20:42:29Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2023-10-12T20:42:35Z

Azure Pipelines successfully started running 9 pipeline(s).

onnxruntime/core/providers/cpu/nn/batch_norm_helper.h

onnxruntime/test/providers/cuda/nhwc/norm_test.cc

hariharans29 · 2023-10-13T18:13:06Z

@gedoensmax - Can you please resolve the conflict ?

gedoensmax · 2023-10-15T21:37:39Z

Done.

hariharans29 · 2023-10-16T05:30:01Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

hariharans29 · 2023-10-16T05:30:13Z

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, ONNX Runtime React Native CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2023-10-16T05:30:52Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2023-10-16T05:30:56Z

Azure Pipelines successfully started running 9 pipeline(s).

### Description This PR: (1) Fixes AMD builds after #17200 broke them (Need to remember to run AMD builds while trying to merge external CUDA PRs next time) (2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time spent in building a few more files and running a few more tests will not be much. Test Linux GPU CI run : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 ### Motivation and Context Keep the NHWC CUDA ops tested (#17200) and guard against regressions

### Description CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW. ### Motivation and Context Especially for convolutional networks this is very important. I will illustrate this using a very simple network: ``` import torch import torch.nn as nn class Net1(nn.Module): def __init__(self): super(Net1, self).__init__() # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.m = nn.ModuleList([ nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1), nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1), nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), ]) def forward(self, x): for module in self.m: x = module(x) return x if __name__ == "__main__": dtype = torch.half device = "cuda" dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device) model = Net1().to(dtype=dtype, device=device) input_names = ["input1"] output_names = ["output1"] torch.onnx.export(model, dummy_input, "test.onnx", input_names=input_names, output_names=output_names) ``` I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges. Current master launches below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b) If I add the introduced `-l` flag we see below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008) Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The `op_generic_tensor_kernel` shows the bias used which should also be optimized out next. Measured across some very basic models: | CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup | |:------------------------|--------------------------------------:|-----------------------------------------:|------------------:| | | -e cuda -t 5 -q | -e cuda -t 5 -q -l | | | resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 | | resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 | | test | 102.07 | 73.62 | 1.39 | Average speedup: 1.53 ## Outlook Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain. --------- Co-authored-by: Tianlei Wu <[email protected]>

### Description This PR: (1) Fixes AMD builds after #17200 broke them (Need to remember to run AMD builds while trying to merge external CUDA PRs next time) (2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time spent in building a few more files and running a few more tests will not be much. Test Linux GPU CI run : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 ### Motivation and Context Keep the NHWC CUDA ops tested (#17200) and guard against regressions

YangQiangli · 2023-11-30T05:36:14Z

Description

CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW.

Motivation and Context

Especially for convolutional networks this is very important. I will illustrate this using a very simple network:
import torch
import torch.nn as nn

class Net1(nn.Module):

    def __init__(self):
        super(Net1, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.m = nn.ModuleList([
            nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
        ])
    def forward(self, x):
        for module in self.m:
            x = module(x)
        return x


if __name__ == "__main__":
    dtype = torch.half
    device = "cuda"

    dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
    model = Net1().to(dtype=dtype, device=device)
    input_names = ["input1"]
    output_names = ["output1"]
    torch.onnx.export(model, dummy_input, "test.onnx",
                      input_names=input_names, output_names=output_names)
I profiled the launch of ./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx using sys and nvtx ranges. Current master launches below kernels:

If I add the introduced -l flag we see below kernels:

Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The op_generic_tensor_kernel shows the bias used which should also be optimized out next.

Measured across some very basic models:

CUDA EP NCHW [ms] NHWC [ms] Speedup
-e cuda -t 5 -q -e cuda -t 5 -q -l
resnet101-v2-7_bs8_fp16 18.33 13.07 1.4
resnet101-v2-7_bs8 21.8 12.06 1.81
test 102.07 73.62 1.39
Average speedup: 1.53

Outlook

Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain.

Hello, I've noticed that NHWC has been merged into the main branch. I tried to enable it by adding "--cmake_extra_defines onnxruntime_USE_CUDA_NHWC_OPS=ON" and compiling. However, when I attempt to test the performance using "onnxruntime_perf_test" with "-e cuda -t 5 -q -l", it gives an error saying there is no "-l" configuration. How can I use onnxruntime_perf_test to test the performance of NHWC?

gedoensmax · 2023-11-30T06:44:01Z

Oh sorry i did not update the description. The option for that will be '-i "prefer_nhwc|1". @YangQiangli

### Description  - Treat Resize as layout sensitive by default - whilst the ONNX spec does not specify a layout, EPs tend to implement only one - add second usage in L2 of TransposeOptimizer to plugin the ability to push a Transpose through a Resize assigned to the CPU EP - Allow EP specific logic for changes the ops considered to be layout sensitive to be plugged in - expected usage is for microsoft#17200 ### Motivation and Context  Finish simplifying/clarifying transpose optimization and layout transformation that was proposed in microsoft#15552. This PR along with microsoft#17618 should complete the changes. --------- Co-authored-by: Edward Chen <[email protected]>

### Description CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW. ### Motivation and Context Especially for convolutional networks this is very important. I will illustrate this using a very simple network: ``` import torch import torch.nn as nn class Net1(nn.Module): def __init__(self): super(Net1, self).__init__() # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.m = nn.ModuleList([ nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1), nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1), nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), ]) def forward(self, x): for module in self.m: x = module(x) return x if __name__ == "__main__": dtype = torch.half device = "cuda" dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device) model = Net1().to(dtype=dtype, device=device) input_names = ["input1"] output_names = ["output1"] torch.onnx.export(model, dummy_input, "test.onnx", input_names=input_names, output_names=output_names) ``` I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges. Current master launches below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b) If I add the introduced `-l` flag we see below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008) Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The `op_generic_tensor_kernel` shows the bias used which should also be optimized out next. Measured across some very basic models: | CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup | |:------------------------|--------------------------------------:|-----------------------------------------:|------------------:| | | -e cuda -t 5 -q | -e cuda -t 5 -q -l | | | resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 | | resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 | | test | 102.07 | 73.62 | 1.39 | Average speedup: 1.53 ## Outlook Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain. --------- Co-authored-by: Tianlei Wu <[email protected]>

…oft#17972) ### Description This PR: (1) Fixes AMD builds after microsoft#17200 broke them (Need to remember to run AMD builds while trying to merge external CUDA PRs next time) (2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time spent in building a few more files and running a few more tests will not be much. Test Linux GPU CI run : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 ### Motivation and Context Keep the NHWC CUDA ops tested (microsoft#17200) and guard against regressions

github-advanced-security bot found potential problems Aug 17, 2023

View reviewed changes

benchmark.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Aug 17, 2023

View reviewed changes

onnxruntime/core/providers/cpu/nn/conv_transpose_attributes.h Fixed Show fixed Hide fixed

onnxruntime/core/providers/cpu/nn/instance_norm_helper.h Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Aug 17, 2023

View reviewed changes

onnxruntime/test/perftest/ort_test_session.cc Fixed Show fixed Hide fixed

tianleiwu reviewed Aug 18, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/fused_conv.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Aug 18, 2023

View reviewed changes

onnxruntime/core/providers/cuda/nn/conv_transpose.cc Show resolved Hide resolved

hariharans29 reviewed Aug 18, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/fused_conv.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Aug 18, 2023

View reviewed changes

onnxruntime/test/perftest/ort_test_session.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Aug 18, 2023

View reviewed changes

onnxruntime/test/perftest/ort_test_session.cc Outdated Show resolved Hide resolved

gedoensmax commented Aug 24, 2023

View reviewed changes

onnxruntime/core/providers/cuda/nn/conv.cc Outdated Show resolved Hide resolved

gedoensmax force-pushed the cuda_nhwc branch from abe101d to 64939e8 Compare August 29, 2023 23:54

hariharans29 requested a review from pranavsharma August 30, 2023 00:08

github-advanced-security bot found potential problems Aug 30, 2023

View reviewed changes

onnxruntime/test/perftest/ort_test_session.cc Fixed Show fixed Hide fixed

onnxruntime/test/providers/cuda/nhwc/nhwc_cuda_helper.h Fixed Show fixed Hide fixed

gedoensmax changed the title ~~Draft: Make CUDA a NHWC EP~~ Make CUDA a NHWC EP Aug 30, 2023

gedoensmax requested a review from hariharans29 August 30, 2023 23:19

hariharans29 reviewed Sep 6, 2023

View reviewed changes

onnxruntime/test/contrib_ops/fused_conv_test.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Sep 6, 2023

View reviewed changes

onnxruntime/core/optimizer/conv_activation_fusion.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Sep 6, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/fused_conv.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Oct 12, 2023

View reviewed changes

onnxruntime/core/providers/cpu/nn/batch_norm_helper.h Show resolved Hide resolved

hariharans29 reviewed Oct 12, 2023

View reviewed changes

onnxruntime/test/providers/cuda/nhwc/norm_test.cc Show resolved Hide resolved

Merge branch 'main' into cuda_nhwc

33a87f8

gedoensmax dismissed hariharans29’s stale review via 33a87f8 October 15, 2023 21:37

hariharans29 approved these changes Oct 16, 2023

View reviewed changes

hariharans29 merged commit 7c17e33 into microsoft:main Oct 16, 2023
65 checks passed

gedoensmax mentioned this pull request Oct 16, 2023

Enable global TRT timing cache #17865

Merged

tianleiwu mentioned this pull request Oct 16, 2023

Enable NHWC ops in CI Pipelines #17971

Closed

hariharans29 mentioned this pull request Oct 16, 2023

Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI #17972

Merged

hariharans29 mentioned this pull request Oct 28, 2023

Fix regression in perf test runner #18139

Merged

This was referenced Mar 5, 2024

[CUDA][1.17.1][Regression] Access Violation in an inference session Run #19778

Closed

Fix a bug in CUDA pool op #19780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make CUDA a NHWC EP #17200

Make CUDA a NHWC EP #17200

gedoensmax commented Aug 17, 2023 •

edited

Loading

gedoensmax commented Aug 17, 2023

jywu-msft commented Aug 17, 2023

hariharans29 commented Aug 18, 2023

hariharans29 commented Aug 18, 2023 •

edited

Loading

hariharans29 commented Aug 18, 2023 •

edited

Loading

twoapples1 commented Aug 21, 2023

gedoensmax commented Aug 21, 2023 •

edited

Loading

gedoensmax commented Aug 21, 2023

gedoensmax commented Aug 21, 2023

hariharans29 commented Aug 21, 2023

skottmckay commented Aug 21, 2023

skottmckay commented Aug 29, 2023

gedoensmax commented Aug 29, 2023

hariharans29 commented Oct 12, 2023

hariharans29 commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

hariharans29 commented Oct 13, 2023

gedoensmax commented Oct 15, 2023

hariharans29 commented Oct 16, 2023

hariharans29 commented Oct 16, 2023

azure-pipelines bot commented Oct 16, 2023

azure-pipelines bot commented Oct 16, 2023

YangQiangli commented Nov 30, 2023

Description

Motivation and Context

Outlook

gedoensmax commented Nov 30, 2023

Make CUDA a NHWC EP #17200

Make CUDA a NHWC EP #17200

Conversation

gedoensmax commented Aug 17, 2023 • edited Loading

Description

Motivation and Context

Outlook

gedoensmax commented Aug 17, 2023

jywu-msft commented Aug 17, 2023

hariharans29 commented Aug 18, 2023

hariharans29 commented Aug 18, 2023 • edited Loading

hariharans29 commented Aug 18, 2023 • edited Loading

twoapples1 commented Aug 21, 2023

gedoensmax commented Aug 21, 2023 • edited Loading

gedoensmax commented Aug 21, 2023

gedoensmax commented Aug 21, 2023

hariharans29 commented Aug 21, 2023

skottmckay commented Aug 21, 2023

skottmckay commented Aug 29, 2023

gedoensmax commented Aug 29, 2023

hariharans29 commented Oct 12, 2023

hariharans29 commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

azure-pipelines bot commented Oct 12, 2023

hariharans29 commented Oct 13, 2023

gedoensmax commented Oct 15, 2023

hariharans29 commented Oct 16, 2023

hariharans29 commented Oct 16, 2023

azure-pipelines bot commented Oct 16, 2023

azure-pipelines bot commented Oct 16, 2023

YangQiangli commented Nov 30, 2023

Description

Motivation and Context

Outlook

gedoensmax commented Nov 30, 2023

gedoensmax commented Aug 17, 2023 •

edited

Loading

hariharans29 commented Aug 18, 2023 •

edited

Loading

hariharans29 commented Aug 18, 2023 •

edited

Loading

gedoensmax commented Aug 21, 2023 •

edited

Loading