DMLEP QAttention update causal #19533

raoanag · 2024-02-15T21:07:03Z

Description

Bug Fix for QAttentionTest.QAttentionPastState* test failures

The change generates causal mask which is an upper Triangular Boolean Matrix as input to MHA mask. DML internally adds maskFilterValue to the "off" bits in the mask and sets the "on" bits to 0.

Note: Google Test filter = *QAttention*
[==========] Running 14 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from CPU_U8S8_Precision_Tests
[ RUN      ] CPU_U8S8_Precision_Tests.QAttention
[       OK ] CPU_U8S8_Precision_Tests.QAttention (124 ms)
[----------] 1 test from CPU_U8S8_Precision_Tests (124 ms total)

[----------] 13 tests from QAttentionTest
[ RUN      ] QAttentionTest.QAttentionBatch1
[       OK ] QAttentionTest.QAttentionBatch1 (531 ms)
[ RUN      ] QAttentionTest.QAttentionBatch1_Float16
[       OK ] QAttentionTest.QAttentionBatch1_Float16 (0 ms)
[ RUN      ] QAttentionTest.QAttentionBatch2
[       OK ] QAttentionTest.QAttentionBatch2 (441 ms)
[ RUN      ] QAttentionTest.QAttentionMaskPartialSequence
[       OK ] QAttentionTest.QAttentionMaskPartialSequence (410 ms)
[ RUN      ] QAttentionTest.QAttentionMaskExceedSequence
[       OK ] QAttentionTest.QAttentionMaskExceedSequence (398 ms)
[ RUN      ] QAttentionTest.QAttentionNoMaskIndex
[       OK ] QAttentionTest.QAttentionNoMaskIndex (389 ms)
[ RUN      ] QAttentionTest.QAttentionUnidirectional_U8U8
[       OK ] QAttentionTest.QAttentionUnidirectional_U8U8 (11 ms)
[ RUN      ] QAttentionTest.QAttentionUnidirectional_U8S8
[       OK ] QAttentionTest.QAttentionUnidirectional_U8S8 (10 ms)
[ RUN      ] QAttentionTest.QAttentionUnidirectional_CUDA
[       OK ] QAttentionTest.QAttentionUnidirectional_CUDA (0 ms)
[ RUN      ] QAttentionTest.QAttentionPastState_u8u8
[       OK ] QAttentionTest.QAttentionPastState_u8u8 (2683 ms)
[ RUN      ] QAttentionTest.QAttentionPastState_u8s8
[       OK ] QAttentionTest.QAttentionPastState_u8s8 (2674 ms)
[ RUN      ] QAttentionTest.QAttentionPrunedModel
[       OK ] QAttentionTest.QAttentionPrunedModel (399 ms)
[ RUN      ] QAttentionTest.SharedPrepackedWeights
[       OK ] QAttentionTest.SharedPrepackedWeights (89 ms)
[----------] 13 tests from QAttentionTest (8047 ms total)

[----------] Global test environment tear-down
[==========] 14 tests from 2 test suites ran. (8175 ms total)
[  PASSED  ] 14 tests.
memleakdbg:
----- No memory leaks detected -----

Motivation and Context

raoanag · 2024-02-28T01:35:09Z

Windows CI GPU Pipeline

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorQAttention.cpp

sumitsays · 2024-02-28T19:45:46Z

The default -10000.0 MaskFilterValue was originally taken from CPU EP when it used to use that value. But now that it started using std::numeric_limits<float>::lowest(), can you please update it at line 412 and verify the test passes? Ideally it should pass with std::numeric_limits<float>::lowest() as well. If yes, then please update it to std::numeric_limits<float>::lowest() to make it consist with CPU EP.

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorQAttention.cpp

sumitsays · 2024-02-29T19:39:37Z

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorQAttention.cpp

@@ -407,7 +416,7 @@ class DmlOperatorQAttention : public DmlOperator
        mhaOperatorDesc.RelativePositionBiasTensor = nullptr;
        mhaOperatorDesc.OutputTensor = &outputDescs[outputIndex];
        mhaOperatorDesc.Scale = kernelCreationContext.GetOptionalAttribute<float>(AttrName::Scale, gsl::narrow_cast<float>(1.0f / std::sqrt(headSize)));
-        mhaOperatorDesc.MaskFilterValue = kernelCreationContext.GetOptionalAttribute<float>(AttrName::MaskFilterValue, -10'000.0f);
+        mhaOperatorDesc.MaskFilterValue = std::numeric_limits<float>::lowest();


I think we should still give precedence to the user provided value and we should just change the default value to lowest().
mhaOperatorDesc.MaskFilterValue = kernelCreationContext.GetOptionalAttribute<float>(AttrName::MaskFilterValue, std::numeric_limits<float>::lowest());

Did the CPU and CUDA EPs change to lowest() for the default? The contrib ops age still mentions -10000 as the default https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QAttention

For unidirectional that does not seem to be the case as per the reference, https://github.com/microsoft/onnxruntime/blob/d5606cd7ee394ba9444ef509021720ebe63c9856/onnxruntime/contrib_ops/cpu/bert/attention_helper.h#L142C1-L149C6

I will add std::numeric_limits<float>::lowest() only for unidirectional set.

This reverts commit 21ba803.

### Description Bug Fix for `QAttentionTest.QAttentionPastState*` test failures The change generates causal mask which is an upper Triangular Boolean Matrix as input to MHA mask. DML internally adds `maskFilterValue` to the "off" bits in the mask and sets the "on" bits to 0. ``` Note: Google Test filter = *QAttention* [==========] Running 14 tests from 2 test suites. [----------] Global test environment set-up. [----------] 1 test from CPU_U8S8_Precision_Tests [ RUN ] CPU_U8S8_Precision_Tests.QAttention [ OK ] CPU_U8S8_Precision_Tests.QAttention (124 ms) [----------] 1 test from CPU_U8S8_Precision_Tests (124 ms total) [----------] 13 tests from QAttentionTest [ RUN ] QAttentionTest.QAttentionBatch1 [ OK ] QAttentionTest.QAttentionBatch1 (531 ms) [ RUN ] QAttentionTest.QAttentionBatch1_Float16 [ OK ] QAttentionTest.QAttentionBatch1_Float16 (0 ms) [ RUN ] QAttentionTest.QAttentionBatch2 [ OK ] QAttentionTest.QAttentionBatch2 (441 ms) [ RUN ] QAttentionTest.QAttentionMaskPartialSequence [ OK ] QAttentionTest.QAttentionMaskPartialSequence (410 ms) [ RUN ] QAttentionTest.QAttentionMaskExceedSequence [ OK ] QAttentionTest.QAttentionMaskExceedSequence (398 ms) [ RUN ] QAttentionTest.QAttentionNoMaskIndex [ OK ] QAttentionTest.QAttentionNoMaskIndex (389 ms) [ RUN ] QAttentionTest.QAttentionUnidirectional_U8U8 [ OK ] QAttentionTest.QAttentionUnidirectional_U8U8 (11 ms) [ RUN ] QAttentionTest.QAttentionUnidirectional_U8S8 [ OK ] QAttentionTest.QAttentionUnidirectional_U8S8 (10 ms) [ RUN ] QAttentionTest.QAttentionUnidirectional_CUDA [ OK ] QAttentionTest.QAttentionUnidirectional_CUDA (0 ms) [ RUN ] QAttentionTest.QAttentionPastState_u8u8 [ OK ] QAttentionTest.QAttentionPastState_u8u8 (2683 ms) [ RUN ] QAttentionTest.QAttentionPastState_u8s8 [ OK ] QAttentionTest.QAttentionPastState_u8s8 (2674 ms) [ RUN ] QAttentionTest.QAttentionPrunedModel [ OK ] QAttentionTest.QAttentionPrunedModel (399 ms) [ RUN ] QAttentionTest.SharedPrepackedWeights [ OK ] QAttentionTest.SharedPrepackedWeights (89 ms) [----------] 13 tests from QAttentionTest (8047 ms total) [----------] Global test environment tear-down [==========] 14 tests from 2 test suites ran. (8175 ms total) [ PASSED ] 14 tests. memleakdbg: ----- No memory leaks detected ----- ``` ### Motivation and Context

ORT causal mask update

78f582e

raoanag changed the base branch from main to WindowsAI-Old February 15, 2024 21:07

Update DML mask

54dba28

raoanag changed the base branch from WindowsAI-Old to WindowsAI-dev February 23, 2024 20:11

raoanag force-pushed the updateCausal branch from 1f6de40 to 7722fc8 Compare February 28, 2024 01:07

Replace Mask

471081e

raoanag force-pushed the updateCausal branch 2 times, most recently from 3d81068 to 15dc524 Compare February 28, 2024 01:18

Add comments

6a904a7

raoanag force-pushed the updateCausal branch from 15dc524 to 6a904a7 Compare February 28, 2024 01:29

raoanag changed the title ~~Update causal~~ DMLEP QAttention update causal Feb 28, 2024

raoanag requested review from PatriceVignola and sumitsays February 28, 2024 17:28

raoanag marked this pull request as ready for review February 28, 2024 17:28

PatriceVignola reviewed Feb 28, 2024

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorQAttention.cpp Show resolved Hide resolved

sumitsays reviewed Feb 28, 2024

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorQAttention.cpp Show resolved Hide resolved

PatriceVignola reviewed Feb 28, 2024

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorQAttention.cpp Outdated Show resolved Hide resolved

Resolve comments

f8eecc8

sumitsays reviewed Feb 29, 2024

View reviewed changes

update makfiltervalue and test values for u8s8

93c9966

raoanag merged commit 21ba803 into WindowsAI-dev Mar 4, 2024
41 of 52 checks passed

raoanag deleted the updateCausal branch March 4, 2024 17:21

raoanag restored the updateCausal branch March 4, 2024 17:38

raoanag added a commit that referenced this pull request Mar 4, 2024

Revert "DMLEP QAttention update causal (#19533)"

1c98047

This reverts commit 21ba803.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMLEP QAttention update causal #19533

DMLEP QAttention update causal #19533

raoanag commented Feb 15, 2024 •

edited

Loading

raoanag commented Feb 28, 2024

sumitsays commented Feb 28, 2024

sumitsays Feb 29, 2024

PatriceVignola Feb 29, 2024

raoanag Feb 29, 2024 •

edited

Loading

DMLEP QAttention update causal #19533

DMLEP QAttention update causal #19533

Conversation

raoanag commented Feb 15, 2024 • edited Loading

Description

Motivation and Context

raoanag commented Feb 28, 2024

sumitsays commented Feb 28, 2024

sumitsays Feb 29, 2024

Choose a reason for hiding this comment

PatriceVignola Feb 29, 2024

Choose a reason for hiding this comment

raoanag Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

raoanag commented Feb 15, 2024 •

edited

Loading

raoanag Feb 29, 2024 •

edited

Loading