Attention projections (QKV, O) disaggregation #1436

yingchen21 · 2024-07-10T15:31:24Z

Description of changes:
This PR moves the qkv projection (and output projection) from the attention operator into a separate dense layer to support LORA on qkv projection (and output projection).

It also adds support for LLAMA 3, LLAMA 3.1 and LLAMA 3.2 models

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

This change is

yingchen21 · 2024-08-04T04:40:04Z

This PR has been implemented for IncMultiHeadSelfAttention, TreeIncMultiHeadSelfAttention and SpecIncMultiHeadSelfAttention. Its cuda implementation is tested under TP=2, TP=4, both with fusion and w/o fusion.
The backward pass is tested for earlier commit. Due to some issue I had with script testing peft the latest commit's backward pass is not tested yet

yingchen21 · 2024-08-04T05:02:37Z

Rebased onto peft branch and tested forward pass

goliaro

great work! I left some comments. The code is working locally on my machine, so I think we only need to do a bit of cleanup, then we can merge. We also need to remove the unused functions from the attention files (.cu & .cpp) and the deprecated parameters (weight_ptr, bias_ptr). Once done, can you apply the disaggregation to the other models as well (opt, mpt, falcon, etc)?

inference/models/llama.cc

goliaro · 2024-09-03T23:07:10Z

inference/models/llama.cc

@@ -171,6 +188,23 @@ void LLAMA::create_llama_model(FFModel &ff,
      }
    }

+    Tensor mha_input = mha;


can we just reuse the same mha tensor for the input of the output projection?

src/ops/inc_multihead_self_attention.cc

src/ops/inc_multihead_self_attention.cu

src/ops/kernels/linear_kernels.cu

src/ops/linear.cc

src/runtime/request_manager.cc

commented out some alignment test, but should be equivalent to the oriinal test.

goliaro force-pushed the attn-qkv-proj branch from 593a716 to d33e510 Compare July 10, 2024 19:06

yingchen21 force-pushed the attn-qkv-proj branch from 59f209b to d140915 Compare August 4, 2024 05:00

yingchen21 force-pushed the attn-qkv-proj branch 2 times, most recently from 6c4349d to 4acab6c Compare August 7, 2024 23:36

yingchen21 marked this pull request as ready for review August 28, 2024 17:41

yingchen21 force-pushed the attn-qkv-proj branch from 4207293 to e75dbb6 Compare August 30, 2024 09:56

goliaro requested changes Sep 3, 2024

View reviewed changes

yingchen21 changed the base branch from peft to inference September 11, 2024 17:24

goliaro force-pushed the attn-qkv-proj branch from 7a8d200 to 104ba3c Compare September 13, 2024 14:24

yingchen21 and others added 7 commits September 25, 2024 14:03

merged attn-qkv-proj into peft.

80e4d3c

commented out some alignment test, but should be equivalent to the oriinal test.

restored and passed the alignement test

d67c87b

linting

e5cc9ba

rebased onto inference

50d9f38

Bug fixes, uploaded missing cpp implmentation

0928bec

Code cleanup

001422a

clean up

e0ee241

yingchen21 force-pushed the attn-qkv-proj branch from 1bc1c1e to e0ee241 Compare September 25, 2024 14:57

zhihao and others added 2 commits September 25, 2024 19:46

fixed problem with mpt.

d1a1c8e

Merge branch 'inference' into attn-qkv-proj

8f4bc8b

goliaro mentioned this pull request Sep 27, 2024

FlexLLM server demo #1510

Draft

7 tasks

sfc-gh-goliaro and others added 8 commits September 28, 2024 04:37

update

fbac32e

llama3.1 support

22aebb3

fix

7848871

support llama3.2

6bc1eab

fix opt bias?

006ba61

opt alignment test stub

d8c4942

fix bias

e778ffe

update

7ea8bd4

goliaro added 8 commits October 8, 2024 21:52

fix non-fusion opt

cf85d60

update

50a1163

fix

c8c454e

cleanup

d795059

delete file

6ebd2e9

cleanup

214b6bc

shellcheck

c5264c4

Merge branch 'inference' into attn-qkv-proj

342f6a8

goliaro changed the title ~~Attn qkv proj~~ Attention projections (QKV, O) disaggregation Oct 9, 2024

goliaro added 3 commits October 9, 2024 21:42

hip cleanup

e7152ea

fix

a710d6f

hip fixes

85a62a7

goliaro merged commit 96628b3 into inference Oct 9, 2024
39 checks passed

goliaro deleted the attn-qkv-proj branch November 4, 2024 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention projections (QKV, O) disaggregation #1436

Attention projections (QKV, O) disaggregation #1436

yingchen21 commented Jul 10, 2024 •

edited by goliaro

Loading

yingchen21 commented Aug 4, 2024

yingchen21 commented Aug 4, 2024

goliaro left a comment

goliaro Sep 3, 2024

Attention projections (QKV, O) disaggregation #1436

Attention projections (QKV, O) disaggregation #1436

Conversation

yingchen21 commented Jul 10, 2024 • edited by goliaro Loading

yingchen21 commented Aug 4, 2024

yingchen21 commented Aug 4, 2024

goliaro left a comment

Choose a reason for hiding this comment

goliaro Sep 3, 2024

Choose a reason for hiding this comment

yingchen21 commented Jul 10, 2024 •

edited by goliaro

Loading