Add Continuous Decoding support in GQA #21523

aciddelgado · 2024-07-26T17:02:29Z

Description

This PR will add support for Continuous Decoding for batch_size = 1 input. From now on, GQA can take arbitrary length input using seqlens_k as total_sequence_length - 1 and the sequence length of qkv as new_sequence_length.

This change will not affect the default behavior of GQA

Motivation and Context

Prior to this change it was impossible to support sequence_length > 1 inputs when past context was given. This use case is essential to making continuous decoding work, which is one of our current efforts in ORT-GenAI.

onnxruntime/test/python/transformers/test_gqa_cpu.py

onnxruntime/test/python/transformers/test_flash_attn_cuda.py

onnxruntime/test/python/transformers/test_gqa_cpu.py

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h

… switch up the plan again

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h

Results are validated with model-generate.py by using a int4 quantized model as the original model's assistant. The output sequence is the same and increased tps is observed. NOTE: Only MHA decoder only models, batch size 1, CPU, greedy select top is supported in this initial version. GQA needs microsoft/onnxruntime#21523 to support seqlen > 1 in token phase. * Updated builder.py to produce MHA graph that supports seqlen > 1 in token phase. * Introduce speculative decoding currently through a separate Generator class. This can be merged with existing Generator potentially on either API level or implementation level. * Extended various components for functionalities to support speculative search. Previously most methods are hardcoded assuming seqlen == 1 for token phase.

onnxruntime/test/python/transformers/test_flash_attn_cuda.py

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

onnxruntime/contrib_ops/cpu/bert/attention_common.h

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu

yufenglee

docs/ContribOperators.md

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h

onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h

onnxruntime/contrib_ops/cuda/bert/group_query_attention_helper.h

tianleiwu · 2024-09-11T18:11:00Z

Please fix PREfast warnings.

tianleiwu · 2024-09-11T21:32:53Z

Seems that you also need change this line:
https://github.com/microsoft/onnxruntime/blob/4d824045444756ba70223c32ae11693a252adde6/onnxruntime/contrib_ops/rocm/bert/group_query_attention.cu#L8C1-L8C64

tianleiwu · 2024-09-11T23:24:12Z

Python format failed. Please run lintrunner

onnxruntime/core/graph/contrib_ops/bert_defs.cc

docs/ContribOperators.md

Results are validated with model-generate.py by using a int4 quantized model as the original model's assistant. The output sequence is the same and increased tps is observed. NOTE: Only MHA decoder only models, batch size 1, CPU, greedy select top is supported in this initial version. GQA needs microsoft/onnxruntime#21523 to support seqlen > 1 in token phase. * Updated builder.py to produce MHA graph that supports seqlen > 1 in token phase. * Introduce speculative decoding currently through a separate Generator class. This can be merged with existing Generator potentially on either API level or implementation level. * Extended various components for functionalities to support speculative search. Previously most methods are hardcoded assuming seqlen == 1 for token phase.

gqa supports interactive

ee47ba4

aciddelgado requested review from tianleiwu and yufenglee July 26, 2024 17:02

github-advanced-security bot found potential problems Jul 26, 2024

View reviewed changes

onnxruntime/test/python/transformers/test_gqa_cpu.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_gqa_cpu.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jul 26, 2024

View reviewed changes

onnxruntime/test/python/transformers/test_gqa_cpu.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_flash_attn_cuda.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_gqa_cpu.py Fixed Show fixed Hide fixed

lint, clang, clean-up manually

d938816

github-advanced-security bot found potential problems Jul 26, 2024

View reviewed changes

aciddelgado added 4 commits July 29, 2024 09:33

Merge branch 'main' into aciddelgado/gqa_interactive

01cd33b

new idea, seqlense_q

fdc84b4

cpu update

1cddf5f

cpu almost works but segfaults on non-interactive prompt but we gotta…

8e3483e

… switch up the plan again

github-advanced-security bot found potential problems Jul 31, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h Fixed Show fixed Hide fixed

single batch implementation unclean

60fe746

github-advanced-security bot found potential problems Aug 6, 2024

View reviewed changes

onnxruntime/test/python/transformers/test_flash_attn_cuda.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_flash_attn_cuda.py Fixed Show fixed Hide fixed

aciddelgado added 10 commits August 6, 2024 14:53

clean up code

dd3c4a6

clang lint

3565dc2

changes

bd83af7

trigger pipelines and whatnot

aaa9866

merge main

d4e72f8

pipeline

d8f37c0

pipelines

548ab9b

merge main

8358819

minor rotary test change

5ff6050

pls

255fa1c

yufenglee reviewed Sep 9, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h Show resolved Hide resolved

yufenglee reviewed Sep 9, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h Show resolved Hide resolved

fixes

11c4a0e

aciddelgado marked this pull request as ready for review September 9, 2024 23:23

yufenglee reviewed Sep 9, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_common.h Outdated Show resolved Hide resolved

yufenglee reviewed Sep 10, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu Outdated Show resolved Hide resolved

yufenglee reviewed Sep 10, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu Outdated Show resolved Hide resolved

yufenglee reviewed Sep 10, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu Show resolved Hide resolved

docs

4a86c55

aciddelgado changed the title ~~Add Interactive Decoding support in GQA~~ Add Continuous Decoding support in GQA Sep 10, 2024

aciddelgado added 2 commits September 10, 2024 10:31

address comments

0be4962

docs

c0cb4c5

yufenglee reviewed Sep 11, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu Outdated Show resolved Hide resolved

yufenglee previously approved these changes Sep 11, 2024

View reviewed changes

tianleiwu reviewed Sep 11, 2024

View reviewed changes

docs/ContribOperators.md Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 11, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 11, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 11, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_helper.h Outdated Show resolved Hide resolved

comments

019b058

aciddelgado dismissed yufenglee’s stale review via 019b058 September 11, 2024 20:59

remove cuda helper

e46ac2d

rocm

a9e4c76

aciddelgado added 2 commits September 11, 2024 17:19

lint

d92e0f4

docs

f089af3

tianleiwu reviewed Sep 12, 2024

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated Show resolved Hide resolved

description

5e8d419

tianleiwu reviewed Sep 13, 2024

View reviewed changes

docs/ContribOperators.md Outdated Show resolved Hide resolved

docs

36ca4d1

tianleiwu approved these changes Sep 13, 2024

View reviewed changes

aciddelgado merged commit 7e2c722 into main Sep 13, 2024
87 checks passed

aciddelgado deleted the aciddelgado/gqa_interactive branch September 13, 2024 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Continuous Decoding support in GQA #21523

Add Continuous Decoding support in GQA #21523

aciddelgado commented Jul 26, 2024 •

edited

Loading

yufenglee left a comment

tianleiwu commented Sep 11, 2024

tianleiwu commented Sep 11, 2024

tianleiwu commented Sep 11, 2024

Add Continuous Decoding support in GQA #21523

Add Continuous Decoding support in GQA #21523

Conversation

aciddelgado commented Jul 26, 2024 • edited Loading

Description

Motivation and Context

yufenglee left a comment

Choose a reason for hiding this comment

tianleiwu commented Sep 11, 2024

tianleiwu commented Sep 11, 2024

tianleiwu commented Sep 11, 2024

aciddelgado commented Jul 26, 2024 •

edited

Loading