Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge WindowsAI to main #18983

Merged
merged 25 commits into from
Jan 5, 2024
Merged

Merge WindowsAI to main #18983

merged 25 commits into from
Jan 5, 2024

Conversation

jeffbloo
Copy link
Contributor

@jeffbloo jeffbloo commented Jan 3, 2024

Merge WindowsAI to main

@jeffbloo jeffbloo requested a review from adtsai January 3, 2024 02:16
@jeffbloo jeffbloo requested a review from a team as a code owner January 3, 2024 02:16
jeffbloo and others added 23 commits January 3, 2024 16:09
Update DML nuget version to 1.13.0
### Description
[Cherry Pick Reviewed]
```
[ OK ] QLinearConcatS8.ExpectFail_WrongZeroPointType_1 (372 ms)
[ RUN ] QLinearConcatS8.InputOne_Dynamic
[ OK ] QLinearConcatS8.InputOne_Dynamic (255 ms)
[ RUN ] QLinearConcatS8.InputOne_Const
[ OK ] QLinearConcatS8.InputOne_Const (255 ms)
[----------] 11 tests from QLinearConcatS8 (3385 ms total)

[----------] Global test environment tear-down
[==========] 21 tests from 3 test suites ran. (9355 ms total)
[ PASSED ] 21 tests.
```
[#16971](#16971)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Xiang Zhang <[email protected]>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Adrian Tsai <[email protected]>
[Cherry Pick Reviewed]
DML EP Implementation for

[QLinearAveragePool](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QLinearAveragePool)
```
Note: Google Test filter = *QLinear*Pool*
[==========] Running 72 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 36 tests from QLinearGlobalAveragePool
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x1x32x32
[       OK ] QLinearGlobalAveragePool.Nhwc_1x1x32x32 (410 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x32x32x1
[       OK ] QLinearGlobalAveragePool.Nchw_1x32x32x1 (641 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x256x8x8
[       OK ] QLinearGlobalAveragePool.Nhwc_1x256x8x8 (156 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x8x8x256
[       OK ] QLinearGlobalAveragePool.Nchw_1x8x8x256 (134 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x255x7x7
[       OK ] QLinearGlobalAveragePool.Nhwc_1x255x7x7 (160 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x7x7x255
[       OK ] QLinearGlobalAveragePool.Nchw_1x7x7x255 (145 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x255x8x8
[       OK ] QLinearGlobalAveragePool.Nhwc_1x255x8x8 (148 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x8x8x255
[       OK ] QLinearGlobalAveragePool.Nchw_1x8x8x255 (129 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x256x7x7
[       OK ] QLinearGlobalAveragePool.Nhwc_1x256x7x7 (134 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x7x7x256
[       OK ] QLinearGlobalAveragePool.Nchw_1x7x7x256 (131 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x256x8x8
[       OK ] QLinearGlobalAveragePool.Nhwc_3x256x8x8 (159 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x8x8x256
[       OK ] QLinearGlobalAveragePool.Nchw_3x8x8x256 (168 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x255x7x7
[       OK ] QLinearGlobalAveragePool.Nhwc_3x255x7x7 (139 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x7x7x255
[       OK ] QLinearGlobalAveragePool.Nchw_3x7x7x255 (170 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x255x8x8
[       OK ] QLinearGlobalAveragePool.Nhwc_3x255x8x8 (155 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x8x8x255
[       OK ] QLinearGlobalAveragePool.Nchw_3x8x8x255 (156 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x256x7x7
[       OK ] QLinearGlobalAveragePool.Nhwc_3x256x7x7 (133 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x7x7x256
[       OK ] QLinearGlobalAveragePool.Nchw_3x7x7x256 (149 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x1x32x32_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_1x1x32x32_S8 (131 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x32x32x1_S8
[       OK ] QLinearGlobalAveragePool.Nchw_1x32x32x1_S8 (127 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x256x8x8_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_1x256x8x8_S8 (153 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x8x8x256_S8
[       OK ] QLinearGlobalAveragePool.Nchw_1x8x8x256_S8 (129 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x255x7x7_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_1x255x7x7_S8 (133 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x7x7x255_S8
[       OK ] QLinearGlobalAveragePool.Nchw_1x7x7x255_S8 (135 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x255x8x8_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_1x255x8x8_S8 (129 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x8x8x255_S8
[       OK ] QLinearGlobalAveragePool.Nchw_1x8x8x255_S8 (152 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_1x256x7x7_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_1x256x7x7_S8 (140 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_1x7x7x256_S8
[       OK ] QLinearGlobalAveragePool.Nchw_1x7x7x256_S8 (133 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x256x8x8_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_3x256x8x8_S8 (135 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x8x8x256_S8
[       OK ] QLinearGlobalAveragePool.Nchw_3x8x8x256_S8 (147 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x255x7x7_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_3x255x7x7_S8 (156 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x7x7x255_S8
[       OK ] QLinearGlobalAveragePool.Nchw_3x7x7x255_S8 (155 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x255x8x8_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_3x255x8x8_S8 (138 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x8x8x255_S8
[       OK ] QLinearGlobalAveragePool.Nchw_3x8x8x255_S8 (155 ms)
[ RUN      ] QLinearGlobalAveragePool.Nhwc_3x256x7x7_S8
[       OK ] QLinearGlobalAveragePool.Nhwc_3x256x7x7_S8 (144 ms)
[ RUN      ] QLinearGlobalAveragePool.Nchw_3x7x7x256_S8
[       OK ] QLinearGlobalAveragePool.Nchw_3x7x7x256_S8 (139 ms)
[----------] 36 tests from QLinearGlobalAveragePool (5968 ms total)

[----------] 36 tests from QLinearPoolTest
[ RUN      ] QLinearPoolTest.AveragePool1D_ExcludePadPixel
[       OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel (480 ms)
[ RUN      ] QLinearPoolTest.AveragePool1D_IncludePadPixel
[       OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel (481 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_ExcludePadPixel
[       OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel (512 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_IncludePadPixel
[       OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel (455 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_MultiChannel
[       OK ] QLinearPoolTest.AveragePool2D_MultiChannel (463 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_ExcludePadPixel
[       OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel (448 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_IncludePadPixel
[       OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel (458 ms)
[ RUN      ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc
[       OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc (171 ms)
[ RUN      ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc
[       OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc (169 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc
[       OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc (152 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc
[       OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc (660 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc
[       OK ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc (150 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc
[       OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc (145 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc
[       OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc (146 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_BigImage
[       OK ] QLinearPoolTest.AveragePool2D_BigImage (505 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_BigImage_nhwc
[       OK ] QLinearPoolTest.AveragePool2D_BigImage_nhwc (161 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_Global
[       OK ] QLinearPoolTest.AveragePool2D_Global (481 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_Global_nhwc
[       OK ] QLinearPoolTest.AveragePool2D_Global_nhwc (152 ms)
[ RUN      ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_S8
[       OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_S8 (461 ms)
[ RUN      ] QLinearPoolTest.AveragePool1D_IncludePadPixel_S8
[       OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel_S8 (448 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_S8
[       OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_S8 (471 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_IncludePadPixel_S8
[       OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel_S8 (473 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_MultiChannel_S8
[       OK ] QLinearPoolTest.AveragePool2D_MultiChannel_S8 (1507 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_S8
[       OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_S8 (477 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_IncludePadPixel_S8
[       OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel_S8 (493 ms)
[ RUN      ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc_S8 (158 ms)
[ RUN      ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc_S8 (146 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc_S8 (146 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc_S8 (158 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc_S8 (157 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc_S8 (145 ms)
[ RUN      ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc_S8 (147 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_BigImage_S8
[       OK ] QLinearPoolTest.AveragePool2D_BigImage_S8 (537 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_BigImage_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool2D_BigImage_nhwc_S8 (173 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_Global_S8
[       OK ] QLinearPoolTest.AveragePool2D_Global_S8 (457 ms)
[ RUN      ] QLinearPoolTest.AveragePool2D_Global_nhwc_S8
[       OK ] QLinearPoolTest.AveragePool2D_Global_nhwc_S8 (150 ms)
[----------] 36 tests from QLinearPoolTest (12914 ms total)

[----------] Global test environment tear-down
[==========] 72 tests from 2 test suites ran. (18885 ms total)
[  PASSED  ] 72 tests.
memleakdbg:
----- No memory leaks detected -----
```

### Description
<!-- Describe your changes. -->

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR also includes,
8b0a55e DML constant pow operator
7520974 Enable custom heaps based on query-

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Jeff Bloomfield <[email protected]>
…tion (#18370)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Jeff Bloomfield <[email protected]>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Jeff Bloomfield <[email protected]>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Jeff Bloomfield <[email protected]>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Jeff Bloomfield <[email protected]>
[Cherry pick Reviewed]

Re-add changes which were merged out...

---------

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Sheil Kumar <[email protected]>
Co-authored-by: Sheil Kumar <[email protected]>
CP 7fd1ce9 for onnxruntime_perf_test
changes.

Co-authored-by: Sheil Kumar <[email protected]>
### Description
1. Expand input datatype support for Resize with uint8/int8.
2. Update the logic to compute output shape of Resize Op, roiRange is
got rid of to align with how tests compute the output shape to go around
the size asserting in MLOperatorAuthorImpl.cpp
`m_inputDimensions[i] * roiRange * scale` -> `m_inputDimensions[i] *
scale`
3. disable 4 tests because of the result mismatch. The results of DML
with float32 and uint8/int8 match each other, so it should be problem of
resize implementation, which is out the scope of this PR.

`ResizeOpTest.NhwcResizeOpLinearDownSampleTest_tf_crop_and_resize_without_extrapolation_uint8

ResizeOpTest.NhwcResizeOpLinearDownSampleTest_tf_crop_and_resize_without_extrapolation_int8

ResizeOpTest.NhwcResizeOpLinearDownSampleTest_4DBilinear_pytorch_half_pixel_uint8

ResizeOpTest.NhwcResizeOpLinearDownSampleTest_4DBilinear_pytorch_half_pixel_int8`
Update resource creation flag to avoid D3D12 WARNING

### Description
Update the DML DX12 allocator to use D3D12_RESOUCE_STATE_COMMON to avoid
DX12 Warning messages.



### Motivation and Context
When directML is created with debug layer there are warnings when
resources are created by ORT.

---------

Co-authored-by: Christian Larson <[email protected]>
### Description
[DirectML EP] Add DML EP registration for Col2Im operator

### Motivation and Context
Add Col2Im support for opset 18.
This operator is implemented as the DirectML Fold operator.

---------

Co-authored-by: Sheil Kumar <[email protected]>
Co-authored-by: Dwayne Robinson <[email protected]>
Hide Col2Im registration behind DML_TARGET_VERSION 6300

Co-authored-by: Sheil Kumar <[email protected]>
…18866)

### Description
This addresses a bug in a fast path that was added for submission of
re-used command lists of fused graph kernels in the DML EP, addressing a
D3D debug layer error.

### Motivation and Context
The fast path in DmlCommandRecorder::ExecuteCommandList enabled a
current non-reused command list, if empty, to be used for commands
following submission of the fused command list. The fix ensures the
associated command allocator is only re-used after the next fence value
is completed, which is higher due to submission of the other command
list.

The command recorder design was intended to support batching of provided
command list execution, however it submits command lists immedately as
an implementation detail to maximize CPU/GPU parallelism. If that
heuristic was removed, it would expose additional issues in this same
fast path. Because of this and complexity and inefficiency of the old
batching mechanism, I also removed this.
#18862)

### Description
Cleanup and rebase from [this
PR](#18629)



### Motivation and Context

---------

Co-authored-by: Christian Larson <[email protected]>
Co-authored-by: Christian Larson <[email protected]>
Co-authored-by: Jeff Bloomfield <[email protected]>
Co-authored-by: Anagha Rao <[email protected]>
#18915)

### Description
This limits the size of constant data nodes which the DML EP creates in
the DML graph following de-duplication of 1D quantization tensors. In
the process it reduces a check for the maximum size of the constant
node.

This is merged from: #18494

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This enables QDQ transforms with the DML EP
@jeffbloo jeffbloo merged commit 55a6694 into main Jan 5, 2024
92 of 100 checks passed
@jeffbloo jeffbloo deleted the WindowsAI branch January 5, 2024 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants