Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

vshampor · 2024-10-16T13:04:23Z

Tickets:
153783

src/core/src/op/paged_attention.cpp

slyalin · 2024-10-28T07:17:35Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

+            pa_arguments.insert(pa_arguments.begin() + 13, v0::Constant::create(element::f32, Shape{0}, {}));
+            pa_arguments.insert(pa_arguments.begin() + 14, v0::Constant::create(element::i32, Shape{0}, {}));


If you make these inputs really optional, these two lines are not required.

slyalin · 2024-10-28T07:19:57Z

src/core/src/op/paged_attention.cpp

+            get_input_partial_shape(13).rank().is_dynamic() ||
+            get_input_partial_shape(13).rank().get_length() == 0 ||
+            get_input_partial_shape(13).rank().get_length() == 1,
+            "Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",


Suggested change

"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",

"Input `rotation_coefficients` should either have rank 1 or omitted, but it has rank ",

"Empty" shape means [0] here, which have rank 1.

slyalin · 2024-10-28T07:20:39Z

src/core/src/op/paged_attention.cpp

+    NODE_VALIDATION_CHECK(
+            this,
+            get_input_partial_shape(13).rank().is_dynamic() ||
+            get_input_partial_shape(13).rank().get_length() == 0 ||


Suggested change

get_input_partial_shape(13).rank().get_length() == 0 ||

slyalin · 2024-10-28T07:23:08Z

src/core/src/op/paged_attention.cpp

+            get_input_partial_shape(14).rank().get_length() == 0 ||
+            get_input_partial_shape(14).rank().get_length() == 1,
+            "Input `rotated_block_indices` should either have an empty shape or rank 1 but it has rank ",


The same comment are applicable here as for input 13 above.

slyalin · 2024-10-28T07:27:03Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

@@ -1576,6 +1591,11 @@ struct AttentionExecutor : public PagedAttentionExecutor {
        if (alibi_slopes) {
            alibi_slopes.assert_dims({H});
        }
+
+        if (rotated_block_indices) {
+            // Rotation, and cache eviction, is limited to cases when Q, K and V embedding sizes are equal, e.g. S == Sv


We already have cases where they are not: minicpm-3

Removed - realized that we don't need that limitation for cache rotation since we only rotate the K values

slyalin · 2024-10-28T07:30:07Z

src/plugins/intel_gpu/src/plugin/ops/paged_attention.cpp

@@ -58,6 +59,10 @@ static void CreatePagedAttentionExtensionOp(ProgramBuilder& p, const std::shared
    OPENVINO_ASSERT(alibi_const != nullptr);
    prim.has_alibi = ov::shape_size(alibi_const->get_output_shape(0)) > 0;

+    std::shared_ptr<ov::op::v0::Constant> rotation_coefficients_const = std::dynamic_pointer_cast<ov::op::v0::Constant>(op->get_input_node_shared_ptr(rotation_coefficients_idx));
+    OPENVINO_ASSERT(rotation_coefficients_const != nullptr);
+    prim.has_rotation_coefficients = ov::shape_size(alibi_const->get_output_shape(0)) > 0;


alibi_const shouldn't be used here -- bad copy&paste?

Fixed, thanks.

dmitry-gorokhov · 2024-11-15T07:52:52Z

@luo-cheng2021 Please review CPU PA changes.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

luo-cheng2021 · 2024-11-18T06:54:51Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+                CT cache_value_1 = *cache_value_1_ptr;
+
+                *cache_value_0_ptr = cache_value_0 * rotation_value_cos - cache_value_1 * rotation_value_sin;
+                *cache_value_1_ptr = cache_value_0 * rotation_value_sin + cache_value_1 * rotation_value_cos;


Is the algorithm same with the following code?

openvino/src/plugins/intel_cpu/src/nodes/rope.cpp

Lines 158 to 161 in c4d6d2b

auto src0 = src[i];

auto src1 = src[i + half_rotary_dims];

dst[i] = cos[i] * src0 - sin[i] * src1;

dst[i + half_rotary_dims] = cos[i + half_rotary_dims] * src1 + sin[i + half_rotary_dims] * src0;

If so, the following code can be used as reference:

openvino/src/plugins/intel_cpu/src/nodes/rope.cpp

Lines 35 to 102 in c4d6d2b

static std::shared_ptr<kernel::JitKernelBase> createJitKernel(const jit_rotary_compile_params& param, bool check_vec_size2 = false) {

std::shared_ptr<kernel::JitKernelBase> res;

MAYBE_UNUSED(param);

MAYBE_UNUSED(check_vec_size2);

#if defined(OPENVINO_ARCH_X86_64)

if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core)) {

bool flag = true;

if (check_vec_size2) {

auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>::vec_size;

if (param.rotary_ndims % (vec_size * 2) != 0)

flag = false;

}

if (flag)

res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>>(param);

} else if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) {

bool flag = true;

if (check_vec_size2) {

auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>::vec_size;

if (param.rotary_ndims % (vec_size * 2) != 0)

flag = false;

}

if (flag)

res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>>(param);

}

if (res)

res->create_kernel();

#endif // OPENVINO_ARCH_X86_64

return res;

}

static void execJitKernel(const std::shared_ptr<kernel::JitKernelBase>& ker, const void* src, void* dst, const float* cos, const float* sin) {

MAYBE_UNUSED(ker);

MAYBE_UNUSED(src);

MAYBE_UNUSED(dst);

MAYBE_UNUSED(cos);

MAYBE_UNUSED(sin);

#if defined(OPENVINO_ARCH_X86_64)

jit_rotary_call_args call_args;

call_args.src = src;

call_args.cos = cos;

call_args.sin = sin;

call_args.dst = dst;

(*ker)(&call_args);

#endif // OPENVINO_ARCH_X86_64

}

template <typename T>

struct RoPE::RoPEExecutorRotateHalf : public RoPE::Executor {

const op::internal::RoPE::Config& m_config;

std::shared_ptr<kernel::JitKernelBase> m_rotaryKernel;

RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {

jit_rotary_compile_params jcp;

jcp.src_prc = precision_of<T>::value;

jcp.dst_prc = precision_of<T>::value;

jcp.rotary_ndims = config.rotary_ndims;

jcp.interleave = false;

m_rotaryKernel = createJitKernel(jcp);

}

I have already written and tested my implementation, besides, the code you've sent me probably cannot be reused without modifications or bulky instantiations.

luo-cheng2021 · 2024-11-18T07:03:26Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+
+template<class CT>
+inline static void rotate_kv_cache_block_hw(CT* cache_block_ptr, float* block_rotation_coefficients_ptr, size_t num_heads, size_t block_size, size_t embedding_size) {
+#if !defined(HAVE_AVX2) && !defined(HAVE_AVX512F)


It should be cleaner if rotate_kv_cache_block_hw and rotate_kv_cache_block_sw are merged and let the rotate_kv_cache_chunk_xxx to handle the tails.

I need HW and SW available as separate functions for testing purposes.

The test style CPU plugin used is to use layer/subgraph(sdpa sample) test to cover, due to CI infrastructure can cover avx2/avx512/amx, so there is no need to create a new test structure, we'd better to create a subgraph test just like sdpa.
But PagedAttention node is a little special, it's no reference now which means no reference result to compare, I think the reference will be coming.
Base on current status, I suggest the new test module may be removed and file a ticket to record the following up.
Correct me if I'm wrong. @dmitry-gorokhov @slyalin

I personally don't mind splitting vector and scalar impls into separate functions. I would say it is even better from code readability standpoint.

Regarding unit tests - that is actaully good developer practice. The fact we haven't implemented unit tests for intrinsics optimizations does't mean we should prohibit it at all. That seems to be convinient for developer purposes, so I am happy to have such an infrastructure.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

praasz · 2024-11-22T16:10:00Z

src/core/tests/type_prop/paged_attention.cpp

+#include "common_test_utils/test_assertions.hpp"
+#include "common_test_utils/type_prop.hpp"
+#include "openvino/openvino.hpp"
+#include "openvino/opsets/opset13.hpp"


Remove it include required operators.

Still not changed but can be optionally done later.

Removed the opset13.hpp and replaced it with the direct header for Parameter

src/core/tests/type_prop/paged_attention.cpp

praasz · 2024-11-27T11:03:09Z

src/core/tests/type_prop/paged_attention.cpp

+#include "common_test_utils/test_assertions.hpp"
+#include "common_test_utils/type_prop.hpp"
+#include "openvino/openvino.hpp"
+#include "openvino/opsets/opset13.hpp"


Still not changed but can be optionally done later.

praasz · 2024-11-27T11:04:56Z

src/plugins/intel_cpu/tests/unit/vectorized/stub.cpp

Is this file required?
If yes add comment why to avoid removal because has only empty test.

Added the comment inside the file in question.

dmitry-gorokhov · 2024-11-27T17:51:51Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+                                              float* current_rotation_coeffts_cos_ptr,
+                                              float* current_rotation_coeffts_sin_ptr,
+                                              size_t num_vectorized_elements_per_iteration,
+                                              size_t is_underutilizing) {


There is a common (in the industry) term called "tail" to describe processing of remaining elements. So would propose to rename to "is_tail" for better readability.

dmitry-gorokhov · 2024-11-27T17:55:07Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+    rotate_kv_cache_block_hw(cache_block_ptr, block_rotation_coefficients_ptr, num_heads, block_size, embedding_size);
+#else
+    rotate_kv_cache_block_sw(cache_block_ptr, block_rotation_coefficients_ptr, num_heads, block_size, embedding_size);


Not sure hw/sw postfix is the most clear naming. Usually devs prefer opt/ref or vector/scalar

Done with opt/ref

dmitry-gorokhov · 2024-11-27T18:58:56Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+
+template<class CT>
+inline static void rotate_kv_cache_block_hw(CT* cache_block_ptr, float* block_rotation_coefficients_ptr, size_t num_heads, size_t block_size, size_t embedding_size) {
+#if !defined(HAVE_AVX2) && !defined(HAVE_AVX512F)


I personally don't mind splitting vector and scalar impls into separate functions. I would say it is even better from code readability standpoint.

Regarding unit tests - that is actaully good developer practice. The fact we haven't implemented unit tests for intrinsics optimizations does't mean we should prohibit it at all. That seems to be convinient for developer purposes, so I am happy to have such an infrastructure.

dmitry-gorokhov · 2024-11-27T19:03:38Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+    auto cache_values_x = _mm512_undefined_ps();
+    auto cache_values_y = _mm512_undefined_ps();
+
+    if (!is_underutilizing) {


Code sample @luo-cheng2021 is reffering to is a common approach for vectorization w/o additional perf penalties.
@vshampor I would highly recommend to try proposed way and compare the performance (not necessary to do in bounds of this PR, this is rather follow up task). Should have some impact on loops with small amount of iterations.

dmitry-gorokhov · 2024-11-27T19:15:01Z

src/plugins/intel_npu/thirdparty/level-zero-ext

seems like level-zero submodule change should be reverted

slyalin · 2024-12-04T08:52:14Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+            for (size_t embedding_pair_idx = 0; embedding_pair_idx < embedding_size / 2; embedding_pair_idx++) {
+                // NB: below is the llama-style rotation (x-like values are in the first half of the embedding vector,
+                // y-like values are in the second half), which is different from the original RoFormer style (x- and y-
+                // values are interleaved), but still preserves the relative positional encoding property


but still preserves the relative positional encoding property

Do you mean it correctly works even if a model uses interleaved mode?

The invariant only holds if we consider the use case starting from model-train time, i.e. models with RoFormer-like, "interleaved" RoPE are supposed to be able to get trained up to roughly the same accuracy as the models with Llama-like "rotate_half" RoPE. At runtime the two are not interchangeable, of course.

Right now that is also a limitation of this cache rotation implementation - only the rotate_half version is supported, and for the proper functioning across models we should implement the rest of the RoPE variants like in

openvino/src/plugins/intel_cpu/src/nodes/rope.cpp

Line 95 in 571e98d

RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {

, with detection of the proper variant based on model metadata.

slyalin · 2024-12-04T08:59:07Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

+            auto rotation_coefficients = setName(std::make_shared<v0::Parameter>(element::f32, PartialShape{-1}),
+                                                 "rotation_coefficients." + std::to_string(layer_index - 1));
+            auto rotated_block_indices = setName(std::make_shared<v0::Parameter>(element::i32, PartialShape{-1}),
+                                                 "rotated_block_indices." + std::to_string(layer_index - 1));


Please specify explicit layout for rotation_coefficients. There are more details in the spec presentation, but there is an area for improvements in terms of clarity: what are these coefficient, how they are packed? Is it really interleaved/not interleaved agnostic? It is not obvious from the n-dim rotation math.

Expressed my concerns about not efficient coefficient packing for PA in the related GenAI PR: openvinotoolkit/openvino.genai#987.

github-actions bot added category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Oct 16, 2024

vshampor force-pushed the token_rotation branch from 2a172b2 to c071571 Compare October 26, 2024 00:52

slyalin requested changes Oct 28, 2024

View reviewed changes

github-actions bot added the category: build OpenVINO cmake script / infra label Oct 30, 2024

vshampor changed the title ~~Add cache rotation inputs~~ Add cache rotation inputs and CPU kernel implementation for cache rotation Nov 12, 2024

vshampor mentioned this pull request Nov 12, 2024

Token rotation openvinotoolkit/openvino.genai#987

Open

vshampor force-pushed the token_rotation branch from d90e212 to ed46cfe Compare November 12, 2024 20:30

dmitry-gorokhov assigned luo-cheng2021 Nov 15, 2024

luo-cheng2021 requested changes Nov 18, 2024

View reviewed changes

luo-cheng2021 reviewed Nov 18, 2024

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp Outdated Show resolved Hide resolved

vshampor requested a review from slyalin November 18, 2024 10:11

vshampor force-pushed the token_rotation branch from afa851c to 8a355f3 Compare November 18, 2024 18:10

github-actions bot added category: Python API OpenVINO Python bindings category: TF FE OpenVINO TensorFlow FrontEnd category: PyTorch FE OpenVINO PyTorch Frontend category: JAX FE OpenVINO JAX FrontEnd labels Nov 18, 2024

vshampor force-pushed the token_rotation branch from 8a355f3 to a33f255 Compare November 19, 2024 10:13

github-actions bot added category: CI OpenVINO public CI github_actions Pull requests that update GitHub Actions code category: NPU OpenVINO NPU plugin labels Nov 19, 2024

vshampor force-pushed the token_rotation branch from 138de47 to a0818c2 Compare November 19, 2024 12:12

vshampor requested a review from luo-cheng2021 November 19, 2024 12:20

vshampor marked this pull request as ready for review November 19, 2024 12:22

vshampor requested review from a team as code owners November 19, 2024 12:22

praasz reviewed Nov 22, 2024

View reviewed changes

vshampor force-pushed the token_rotation branch from 79a4ab1 to 3b775c4 Compare November 25, 2024 13:58

praasz approved these changes Nov 27, 2024

View reviewed changes

dmitry-gorokhov reviewed Nov 27, 2024

View reviewed changes

vshampor force-pushed the token_rotation branch from 3b775c4 to d097b9f Compare December 2, 2024 22:12

github-actions bot added category: dependency_changes Pull requests that update a dependency file and removed category: NPU OpenVINO NPU plugin labels Dec 2, 2024

vshampor added 9 commits December 3, 2024 14:04

Add cache rotation inputs in transformations, CPU and GPU plugins

97a3217

Fix warnings

f1873af

Remove diag message

1c157b5

Fix more warnings

57c59e2

Compile stub test if not on x86

f87b40a

Use 16-core executor

3f3036a

Fix Android builds

51cc7c8

Fix type prop test

3bffc47

Move cache rotation before current KV entry copy

7ed7885

vshampor force-pushed the token_rotation branch from f0d55a0 to e350b5b Compare December 3, 2024 13:17

github-actions bot removed the category: dependency_changes Pull requests that update a dependency file label Dec 3, 2024

vshampor force-pushed the token_rotation branch from e350b5b to 6e806a9 Compare December 3, 2024 13:23

Apply comments

ffcc5ff

vshampor force-pushed the token_rotation branch from 6e806a9 to ffcc5ff Compare December 3, 2024 14:15

Fix build on Win

b0d3b16

vshampor force-pushed the token_rotation branch 2 times, most recently from 0e3ce47 to 1814957 Compare December 3, 2024 20:58

Fix PA build

2b54673

vshampor force-pushed the token_rotation branch from 1814957 to 2b54673 Compare December 3, 2024 21:32

vshampor requested a review from dmitry-gorokhov December 3, 2024 22:24

slyalin reviewed Dec 4, 2024

View reviewed changes

Add NeoX?

439ca8d

		pa_arguments.insert(pa_arguments.begin() + 13, v0::Constant::create(element::f32, Shape{0}, {}));
		pa_arguments.insert(pa_arguments.begin() + 14, v0::Constant::create(element::i32, Shape{0}, {}));

	"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",
	"Input `rotation_coefficients` should either have rank 1 or omitted, but it has rank ",

	auto src0 = src[i];
	auto src1 = src[i + half_rotary_dims];
	dst[i] = cos[i] * src0 - sin[i] * src1;
	dst[i + half_rotary_dims] = cos[i + half_rotary_dims] * src1 + sin[i + half_rotary_dims] * src0;

	static std::shared_ptr<kernel::JitKernelBase> createJitKernel(const jit_rotary_compile_params& param, bool check_vec_size2 = false) {
	std::shared_ptr<kernel::JitKernelBase> res;

	MAYBE_UNUSED(param);
	MAYBE_UNUSED(check_vec_size2);

	#if defined(OPENVINO_ARCH_X86_64)

	if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core)) {
	bool flag = true;
	if (check_vec_size2) {
	auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>::vec_size;
	if (param.rotary_ndims % (vec_size * 2) != 0)
	flag = false;
	}
	if (flag)
	res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>>(param);
	} else if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) {
	bool flag = true;
	if (check_vec_size2) {
	auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>::vec_size;
	if (param.rotary_ndims % (vec_size * 2) != 0)
	flag = false;
	}
	if (flag)
	res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>>(param);
	}

	if (res)
	res->create_kernel();

	#endif // OPENVINO_ARCH_X86_64

	return res;
	}

	static void execJitKernel(const std::shared_ptr<kernel::JitKernelBase>& ker, const void* src, void* dst, const float* cos, const float* sin) {
	MAYBE_UNUSED(ker);
	MAYBE_UNUSED(src);
	MAYBE_UNUSED(dst);
	MAYBE_UNUSED(cos);
	MAYBE_UNUSED(sin);

	#if defined(OPENVINO_ARCH_X86_64)

	jit_rotary_call_args call_args;
	call_args.src = src;
	call_args.cos = cos;
	call_args.sin = sin;
	call_args.dst = dst;
	(*ker)(&call_args);

	#endif // OPENVINO_ARCH_X86_64
	}

	template <typename T>
	struct RoPE::RoPEExecutorRotateHalf : public RoPE::Executor {
	const op::internal::RoPE::Config& m_config;
	std::shared_ptr<kernel::JitKernelBase> m_rotaryKernel;

	RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {
	jit_rotary_compile_params jcp;
	jcp.src_prc = precision_of<T>::value;
	jcp.dst_prc = precision_of<T>::value;
	jcp.rotary_ndims = config.rotary_ndims;
	jcp.interleave = false;
	m_rotaryKernel = createJitKernel(jcp);
	}

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Are you sure you want to change the base?

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Conversation

vshampor commented Oct 16, 2024

slyalin Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-gorokhov commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luo-cheng2021 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

dmitry-gorokhov Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-gorokhov Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Oct 28, 2024 •

edited

Loading

slyalin Oct 28, 2024 •

edited

Loading

luo-cheng2021 Nov 21, 2024 •

edited

Loading

dmitry-gorokhov Nov 27, 2024 •

edited

Loading

dmitry-gorokhov Nov 27, 2024 •

edited

Loading

slyalin Dec 4, 2024 •

edited

Loading