Add GroupQueryAttention with KV-Cache #3425

turneram · 2024-09-06T17:55:48Z

No description provided.

…a-jit

migraphx-bot · 2024-09-24T18:30:37Z

Test	Batch	Rate new 54ff0e	Rate old e230c0	Diff	Compare
torchvision-resnet50	64	3,249.59	3,249.77	-0.01%	✅
torchvision-resnet50_fp16	64	6,985.02	6,987.71	-0.04%	✅
torchvision-densenet121	32	2,429.05	2,431.48	-0.10%	✅
torchvision-densenet121_fp16	32	4,102.80	4,103.92	-0.03%	✅
torchvision-inceptionv3	32	1,639.90	1,637.67	0.14%	✅
torchvision-inceptionv3_fp16	32	2,745.89	2,744.19	0.06%	✅
cadene-inceptionv4	16	779.12	779.19	-0.01%	✅
cadene-resnext64x4	16	809.10	808.74	0.04%	✅
slim-mobilenet	64	7,457.79	7,462.54	-0.06%	✅
slim-nasnetalarge	64	208.17	208.50	-0.16%	✅
slim-resnet50v2	64	3,435.34	3,435.17	0.00%	✅
bert-mrpc-onnx	8	1,147.62	1,150.08	-0.21%	✅
bert-mrpc-tf	1	308.71	314.23	-1.76%	✅
pytorch-examples-wlang-gru	1	396.79	420.51	-5.64%	🔴
pytorch-examples-wlang-lstm	1	381.23	495.59	-23.08%	🔴
torchvision-resnet50_1	1	812.82	770.67	5.47%	🔆
cadene-dpn92_1	1	398.06	402.30	-1.05%	✅
cadene-resnext101_1	1	381.02	381.59	-0.15%	✅
onnx-taau-downsample	1	343.77	343.63	0.04%	✅
dlrm-criteoterabyte	1	35.04	35.05	-0.05%	✅
dlrm-criteoterabyte_fp16	1	58.00	58.08	-0.13%	✅
agentmodel	1	8,074.94	8,076.83	-0.02%	✅
unet_fp16	2	58.08	57.92	0.27%	✅
resnet50v1_fp16	1	931.57	935.45	-0.42%	✅
resnet50v1_int8	1	944.13	956.44	-1.29%	✅
bert_base_cased_fp16	64	1,154.61	1,153.21	0.12%	✅
bert_large_uncased_fp16	32	356.03	355.68	0.10%	✅
bert_large_fp16	1	211.76	211.87	-0.05%	✅
distilgpt2_fp16	16	2,154.18	2,159.18	-0.23%	✅
yolov5s	1	531.07	533.70	-0.49%	✅
tinyllama	1	43.42	43.69	-0.61%	✅
vicuna-fastchat	1	177.91	172.04	3.41%	🔆
whisper-tiny-encoder	1	418.10	417.90	0.05%	✅
whisper-tiny-decoder	1	431.26	424.90	1.50%	✅

This build is not recommended to merge 🔴

migraphx-bot · 2024-09-24T18:30:39Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

bpickrel · 2024-09-18T16:57:16Z

src/include/migraphx/op/group_query_attention.hpp

+ }
+
+ template <class T, class U>
+ void apply_attention(T qkv,


Need description of all the arguments, including the fact that they're all iterators and therefore the contents are mutable (is that right?) Is output the only output? What gets populated and what's the purpose of the content i.e. is it a final result or an intermediate step? It appears this is only a helper function designed to keep the compute() method from getting too large--correct?

bpickrel · 2024-09-18T17:02:51Z

src/include/migraphx/op/group_query_attention.hpp

+ args[0] = args[0].reshape(shape{output_shape_0.type(),
+ {batch_size,
+ sequence_length,
+ static_cast<std::size_t>(num_heads + 2 * kv_num_heads),


Description needed of what's being done with this dimension. It looks important.

bpickrel · 2024-09-18T18:15:45Z

src/include/migraphx/op/instructions_tuple.hpp

+{
+ std::string name() const { return "instructions_tuple"; }
+
+ shape compute_shape(const std::vector<shape>& inputs) const { return shape(inputs); }


This shape constructor deprecated?

bpickrel · 2024-09-18T18:17:05Z

src/include/migraphx/operators.hpp

@@ -64,6 +64,7 @@
 #include <migraphx/op/gathernd.hpp>
 #include <migraphx/op/get_tuple_elem.hpp>
 #include <migraphx/op/greater.hpp>
+#include <migraphx/op/group_query_attention.hpp>


Is instructions_tuple a new op as well?

bpickrel · 2024-09-19T20:37:40Z

src/targets/gpu/kernels/include/migraphx/kernels/gqa_softmax.hpp

+ nextafterf(x, numeric_max<T>()) >= y;
+}
+
+template <class T>


Suggested change

template <class T>

/**

* Calculate softmax function in-place in array score.

*/

template <class T>

bpickrel · 2024-09-19T21:25:42Z

src/targets/gpu/prefuse_ops.cpp

+ shape compute_shape(std::vector<shape> inputs) const
+ {
+ auto query_lens = inputs.front().lens();
+ std::vector<std::size_t> output_lens{query_lens.at(0), num_heads, query_lens.at(2), 4096};


Do we want to keep this magic number?

Im curious where it came from.

Just an artifact of early hard-coding that I happened to miss because for llama2 it always ends up being 4096.

bpickrel · 2024-09-19T21:41:45Z

src/targets/gpu/prefuse_ops.cpp

+};
+MIGRAPHX_REGISTER_OP(gpu_concat_past_present);
+
+struct find_group_query_attention


What is the effect of this matcher? Is it here because a group_query_attention by itself won't work?

bpickrel · 2024-09-19T21:43:45Z

test/onnx/gen_onnx.py

+ dims=tsl_val.shape,
+ vals=tsl_val.astype(int))
+ cc_val = np.ones([4096, 64], dtype=np.float16)
+ cos_cache = helper.make_tensor(name="cos_cache",


What sort of values would cos_cache and sin_cache hold in a realistic scenario?

Partial answer: these are (I think) rotational matrices used in Rotational Position Encoding (ROPE) as seen in ROFORMER: ENHANCED TRANSFORMER WITH ROTARY
POSITION EMBEDDING, one of several possible positional embedding schemes that can be used for attention models. Relative, as opposed to absolute, position embedding is a key feature of GQA.

pfultz2 · 2024-09-26T16:03:27Z

src/targets/gpu/prefuse_ops.cpp

+ auto rotary_interleaved = v.at("rotary_interleaved").to<int>();
+ assert(v.contains("scale"));
+ auto scale = v.at("scale").to<float>();
+ assert(v.contains("present_kv_seqlen"));


There is no need to assert when you already calling at.

pfultz2 · 2024-09-26T16:05:50Z

src/targets/gpu/prefuse_ops.cpp

+ transposed_qkv = mpm.get_module().insert_instruction(
+ ins, make_op("transpose", {{"permutation", {0, 2, 1, 3}}}), transposed_qkv);
+ transposed_qkv =
+ mpm.get_module().insert_instruction(ins, make_op("contiguous"), transposed_qkv);


Why is a contiguous inserted?

pfultz2 · 2024-09-26T16:29:12Z

src/targets/gpu/kernels/include/migraphx/kernels/compute_attention_probabilities.hpp

+ const int kv_num_heads = params.kv_num_heads;
+ const int packed_batch_stride = (num_heads + 2 * kv_num_heads) * sequence_length * head_size;
+ const int kv_num_heads_factor = num_heads / kv_num_heads;
+ const size_t q_input_chunk_length = static_cast<size_t>(sequence_length) * head_size; // S x H


Why static_cast to size_t when you can just declare the original variable as size_t?

pfultz2 · 2024-09-26T16:29:58Z

src/targets/gpu/kernels/include/migraphx/kernels/compute_attention_probabilities.hpp

+ const int batch_size = params.batch_size;
+ const int sequence_length = params.sequence_length;
+ const int head_size = params.head_size;
+ const size_t present_buffer_sequence_length = params.seqlen_present_kv_cache;


Use index_int instead of size_t.

pfultz2 · 2024-09-26T16:34:15Z