Add GQA support for ROCm #21032

cloudhan · 2024-06-13T09:36:09Z

depends on

…'t need to explicit unpack the packed qkv tensor

onnxruntime/test/python/transformers/test_flash_attn_rocm.py

cloudhan · 2024-06-28T07:36:30Z

CI test revealed something like the following

kw = {}

    @wraps(func)
    def standalone_func(*a, **kw):
>       return func(*(a + p.args), **p.kwargs, **kw)

.local/lib/python3.9/site-packages/parameterized/parameterized.py:620: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/onnxruntime_src/onnxruntime/test/python/transformers/test_flash_attn_rocm.py:58: in test_gqa_past_flash_attention
    parity_check_gqa_past(
/onnxruntime_src/onnxruntime/test/python/transformers/test_flash_attn_cuda.py:1702: in parity_check_gqa_past
    numpy.testing.assert_allclose(out, out_ref, rtol=rtol, atol=atol, equal_nan=True, err_msg=err_msg)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<function assert_allclose.<locals>.compare at 0x7f28fe08e280>, array([[[[-8.6060e-03,  4.1046e-02, -2.5604e-02, ..., ...n, nan],
         [nan, nan, nan, ..., nan, nan, nan],
         [nan, nan, nan, ..., nan, nan, nan]]]], dtype=float16))
kwds = {'equal_nan': True, 'err_msg': ' with Config(batch_size=5, sequence_length=1, kv_sequence_length=2048, past_sequence_l...ue, rotary_interleaved=False, packed=True', 'header': 'Not equal to tolerance rtol=0.002, atol=0.002', 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Not equal to tolerance rtol=0.002, atol=0.002
E            with Config(batch_size=5, sequence_length=1, kv_sequence_length=2048, past_sequence_length=227, num_heads=32, kv_num_heads=8, head_size=256, ep=rocm), causal=True, local=False, past_format=1, rotary=True, rotary_interleaved=False, packed=True
E           x and y nan location mismatch:
E            x: array([[[[-8.6060e-03,  4.1046e-02, -2.5604e-02, ..., -7.4829e-02,
E                      5.8060e-03, -2.0828e-03],
E                    [ 4.0207e-03,  7.6523e-03,  1.5244e-02, ..., -4.6326e-02,...
E            y: array([[[[nan, nan, nan, ..., nan, nan, nan],
E                    [nan, nan, nan, ..., nan, nan, nan],
E                    [nan, nan, nan, ..., nan, nan, nan],...

/opt/miniconda/envs/rocm-ci/lib/python3.9/contextlib.py:79: AssertionError

and some sparse 'inf' in other tests. This however, happened to the y value, aka, the reference value. I locally reproduced many of these issue and update torch (along with torch triton) to 2.3.1 eliminate all of them.

…iled with nan and inf from reference values

…otaryEmbeddingKernel

…_USE_COMPOSABLE_KERNEL

onnxruntime/test/python/transformers/test_flash_attn_rocm.py

tianleiwu · 2024-07-01T17:22:41Z

LGTM except there is a build error:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1423537&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=66420422-c7d6-5f71-625c-4b7851c9b9ba&l=3997

CMakeFiles/onnxruntime_providers_rocm.dir/onnxruntime_src/onnxruntime/contrib_ops/rocm/bert/skip_layer_norm_impl.cu.o
/onnxruntime_src/onnxruntime/contrib_ops/rocm/bert/group_query_attention.cu:5:10: fatal error: 'ck_tile/core/numeric/integer.hpp' file not found
#include "ck_tile/core/numeric/integer.hpp"

cloudhan · 2024-07-03T02:03:15Z

@snnn need an es approve. The some packages in CI are updated due to some nan and inf are produced from the reference impl, see my previous comment.

The test_flash_attn_rocm.py from #21032 failed frequently. For example, I saw two failed jobs today: E Max absolute difference: 0.002167 E Max absolute difference: 0.002686 Adjust the abs threshold from 0.002 to 0.005, and use default relative tolerance rtol=0.001.

cloudhan added 8 commits June 4, 2024 08:23

feat: init rocm gqa

c877adc

feat: extend strided copy to support runtime tok idx

f845099

more case

0ea3335

feat: local

99b2feb

feat: rotary

816249c

feat: allow rotary to read and write in a strided way, so that we don…

6024dc9

…'t need to explicit unpack the packed qkv tensor

fix: rotary for packed qkv

48092ee

remove debug print

de2f30a

github-advanced-security bot found potential problems Jun 13, 2024

View reviewed changes

onnxruntime/test/python/transformers/test_flash_attn_rocm.py Fixed Show fixed Hide fixed

cloudhan force-pushed the guangyunhan/rocm-gqa branch 5 times, most recently from b6be9bd to 14d1a1a Compare June 20, 2024 14:43

cloudhan added 3 commits June 21, 2024 00:59

workaround: add flash_attn test to ci

6c4e612

add gpu arch checking warning log

e9f6d13

fix: build without ck tile

2b0c46e

cloudhan force-pushed the guangyunhan/rocm-gqa branch from 14d1a1a to 2b0c46e Compare June 21, 2024 00:59

cloudhan added 3 commits June 28, 2024 08:23

test: update ci pytorch and triton version to fix tests which have fa…

6091a69

…iled with nan and inf from reference values

format

8ca0634

remove unused param is_input_bnsh_format from strided version LaunchR…

e22dfb9

…otaryEmbeddingKernel

cloudhan marked this pull request as ready for review July 1, 2024 04:18

cloudhan requested a review from a team as a code owner July 1, 2024 04:18

cloudhan requested review from tianleiwu and yufenglee July 1, 2024 04:19

cloudhan added 2 commits July 1, 2024 10:32

make onnxruntime_USE_COMPOSABLE_KERNEL_CK_TILE depends on onnxruntime…

789fee7

…_USE_COMPOSABLE_KERNEL

skip test_flash_attn_rocm on CUDA platform

b973217

github-advanced-security bot found potential problems Jul 1, 2024

View reviewed changes

onnxruntime/test/python/transformers/test_flash_attn_rocm.py Fixed Show fixed Hide fixed

cloudhan added 2 commits July 2, 2024 00:46

fix lint and ci

c3c7089

fix typo

f4355d4

cloudhan force-pushed the guangyunhan/rocm-gqa branch from 1384ff4 to f4355d4 Compare July 2, 2024 00:49

tianleiwu approved these changes Jul 2, 2024

View reviewed changes

mszhanyi approved these changes Jul 3, 2024

View reviewed changes

cloudhan merged commit f39ee14 into main Jul 3, 2024
92 of 100 checks passed

cloudhan deleted the guangyunhan/rocm-gqa branch July 3, 2024 06:55

tianleiwu mentioned this pull request Jul 16, 2024

[ROCM] adjust test_flash_attn_rocm test tolerance #21379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GQA support for ROCm #21032

Add GQA support for ROCm #21032

cloudhan commented Jun 13, 2024 •

edited

Loading

cloudhan commented Jun 28, 2024

tianleiwu commented Jul 1, 2024 •

edited

Loading

cloudhan commented Jul 3, 2024 •

edited

Loading

Add GQA support for ROCm #21032

Add GQA support for ROCm #21032

Conversation

cloudhan commented Jun 13, 2024 • edited Loading

cloudhan commented Jun 28, 2024

tianleiwu commented Jul 1, 2024 • edited Loading

cloudhan commented Jul 3, 2024 • edited Loading

cloudhan commented Jun 13, 2024 •

edited

Loading

tianleiwu commented Jul 1, 2024 •

edited

Loading

cloudhan commented Jul 3, 2024 •

edited

Loading