Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

vshampor
Copy link
Contributor

Tickets:
153783

@github-actions github-actions bot added category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Oct 16, 2024
src/core/src/op/paged_attention.cpp Outdated Show resolved Hide resolved
Comment on lines 418 to 419
pa_arguments.insert(pa_arguments.begin() + 13, v0::Constant::create(element::f32, Shape{0}, {}));
pa_arguments.insert(pa_arguments.begin() + 14, v0::Constant::create(element::i32, Shape{0}, {}));
Copy link
Contributor

@slyalin slyalin Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make these inputs really optional, these two lines are not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

get_input_partial_shape(13).rank().is_dynamic() ||
get_input_partial_shape(13).rank().get_length() == 0 ||
get_input_partial_shape(13).rank().get_length() == 1,
"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",
"Input `rotation_coefficients` should either have rank 1 or omitted, but it has rank ",

"Empty" shape means [0] here, which have rank 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

NODE_VALIDATION_CHECK(
this,
get_input_partial_shape(13).rank().is_dynamic() ||
get_input_partial_shape(13).rank().get_length() == 0 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
get_input_partial_shape(13).rank().get_length() == 0 ||

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 167 to 169
get_input_partial_shape(14).rank().get_length() == 0 ||
get_input_partial_shape(14).rank().get_length() == 1,
"Input `rotated_block_indices` should either have an empty shape or rank 1 but it has rank ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment are applicable here as for input 13 above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1576,6 +1591,11 @@ struct AttentionExecutor : public PagedAttentionExecutor {
if (alibi_slopes) {
alibi_slopes.assert_dims({H});
}

if (rotated_block_indices) {
// Rotation, and cache eviction, is limited to cases when Q, K and V embedding sizes are equal, e.g. S == Sv
Copy link
Contributor

@slyalin slyalin Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have cases where they are not: minicpm-3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed - realized that we don't need that limitation for cache rotation since we only rotate the K values

@@ -58,6 +59,10 @@ static void CreatePagedAttentionExtensionOp(ProgramBuilder& p, const std::shared
OPENVINO_ASSERT(alibi_const != nullptr);
prim.has_alibi = ov::shape_size(alibi_const->get_output_shape(0)) > 0;

std::shared_ptr<ov::op::v0::Constant> rotation_coefficients_const = std::dynamic_pointer_cast<ov::op::v0::Constant>(op->get_input_node_shared_ptr(rotation_coefficients_idx));
OPENVINO_ASSERT(rotation_coefficients_const != nullptr);
prim.has_rotation_coefficients = ov::shape_size(alibi_const->get_output_shape(0)) > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alibi_const shouldn't be used here -- bad copy&paste?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Oct 30, 2024
@vshampor vshampor changed the title Add cache rotation inputs Add cache rotation inputs and CPU kernel implementation for cache rotation Nov 12, 2024
@dmitry-gorokhov
Copy link
Contributor

@luo-cheng2021 Please review CPU PA changes.

CT cache_value_1 = *cache_value_1_ptr;

*cache_value_0_ptr = cache_value_0 * rotation_value_cos - cache_value_1 * rotation_value_sin;
*cache_value_1_ptr = cache_value_0 * rotation_value_sin + cache_value_1 * rotation_value_cos;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the algorithm same with the following code?

auto src0 = src[i];
auto src1 = src[i + half_rotary_dims];
dst[i] = cos[i] * src0 - sin[i] * src1;
dst[i + half_rotary_dims] = cos[i + half_rotary_dims] * src1 + sin[i + half_rotary_dims] * src0;

If so, the following code can be used as reference:
static std::shared_ptr<kernel::JitKernelBase> createJitKernel(const jit_rotary_compile_params& param, bool check_vec_size2 = false) {
std::shared_ptr<kernel::JitKernelBase> res;
MAYBE_UNUSED(param);
MAYBE_UNUSED(check_vec_size2);
#if defined(OPENVINO_ARCH_X86_64)
if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core)) {
bool flag = true;
if (check_vec_size2) {
auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>::vec_size;
if (param.rotary_ndims % (vec_size * 2) != 0)
flag = false;
}
if (flag)
res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>>(param);
} else if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) {
bool flag = true;
if (check_vec_size2) {
auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>::vec_size;
if (param.rotary_ndims % (vec_size * 2) != 0)
flag = false;
}
if (flag)
res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>>(param);
}
if (res)
res->create_kernel();
#endif // OPENVINO_ARCH_X86_64
return res;
}
static void execJitKernel(const std::shared_ptr<kernel::JitKernelBase>& ker, const void* src, void* dst, const float* cos, const float* sin) {
MAYBE_UNUSED(ker);
MAYBE_UNUSED(src);
MAYBE_UNUSED(dst);
MAYBE_UNUSED(cos);
MAYBE_UNUSED(sin);
#if defined(OPENVINO_ARCH_X86_64)
jit_rotary_call_args call_args;
call_args.src = src;
call_args.cos = cos;
call_args.sin = sin;
call_args.dst = dst;
(*ker)(&call_args);
#endif // OPENVINO_ARCH_X86_64
}
template <typename T>
struct RoPE::RoPEExecutorRotateHalf : public RoPE::Executor {
const op::internal::RoPE::Config& m_config;
std::shared_ptr<kernel::JitKernelBase> m_rotaryKernel;
RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {
jit_rotary_compile_params jcp;
jcp.src_prc = precision_of<T>::value;
jcp.dst_prc = precision_of<T>::value;
jcp.rotary_ndims = config.rotary_ndims;
jcp.interleave = false;
m_rotaryKernel = createJitKernel(jcp);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already written and tested my implementation, besides, the code you've sent me probably cannot be reused without modifications or bulky instantiations.


template<class CT>
inline static void rotate_kv_cache_block_hw(CT* cache_block_ptr, float* block_rotation_coefficients_ptr, size_t num_heads, size_t block_size, size_t embedding_size) {
#if !defined(HAVE_AVX2) && !defined(HAVE_AVX512F)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be cleaner if rotate_kv_cache_block_hw and rotate_kv_cache_block_sw are merged and let the rotate_kv_cache_chunk_xxx to handle the tails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need HW and SW available as separate functions for testing purposes.

Copy link
Contributor

@luo-cheng2021 luo-cheng2021 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test style CPU plugin used is to use layer/subgraph(sdpa sample) test to cover, due to CI infrastructure can cover avx2/avx512/amx, so there is no need to create a new test structure, we'd better to create a subgraph test just like sdpa.
But PagedAttention node is a little special, it's no reference now which means no reference result to compare, I think the reference will be coming.
Base on current status, I suggest the new test module may be removed and file a ticket to record the following up.
Correct me if I'm wrong. @dmitry-gorokhov @slyalin

Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't mind splitting vector and scalar impls into separate functions. I would say it is even better from code readability standpoint.

Regarding unit tests - that is actaully good developer practice. The fact we haven't implemented unit tests for intrinsics optimizations does't mean we should prohibit it at all. That seems to be convinient for developer purposes, so I am happy to have such an infrastructure.

@github-actions github-actions bot added category: Python API OpenVINO Python bindings category: TF FE OpenVINO TensorFlow FrontEnd category: PyTorch FE OpenVINO PyTorch Frontend category: JAX FE OpenVINO JAX FrontEnd labels Nov 18, 2024
@github-actions github-actions bot added category: CI OpenVINO public CI github_actions Pull requests that update GitHub Actions code category: NPU OpenVINO NPU plugin labels Nov 19, 2024
@vshampor vshampor marked this pull request as ready for review November 19, 2024 12:22
@vshampor vshampor requested review from a team as code owners November 19, 2024 12:22
#include "common_test_utils/test_assertions.hpp"
#include "common_test_utils/type_prop.hpp"
#include "openvino/openvino.hpp"
#include "openvino/opsets/opset13.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it include required operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not changed but can be optionally done later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the opset13.hpp and replaced it with the direct header for Parameter

src/core/tests/type_prop/paged_attention.cpp Outdated Show resolved Hide resolved
src/core/tests/type_prop/paged_attention.cpp Outdated Show resolved Hide resolved
src/core/tests/type_prop/paged_attention.cpp Outdated Show resolved Hide resolved
#include "common_test_utils/test_assertions.hpp"
#include "common_test_utils/type_prop.hpp"
#include "openvino/openvino.hpp"
#include "openvino/opsets/opset13.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not changed but can be optionally done later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file required?
If yes add comment why to avoid removal because has only empty test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment inside the file in question.

float* current_rotation_coeffts_cos_ptr,
float* current_rotation_coeffts_sin_ptr,
size_t num_vectorized_elements_per_iteration,
size_t is_underutilizing) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a common (in the industry) term called "tail" to describe processing of remaining elements. So would propose to rename to "is_tail" for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 221 to 223
rotate_kv_cache_block_hw(cache_block_ptr, block_rotation_coefficients_ptr, num_heads, block_size, embedding_size);
#else
rotate_kv_cache_block_sw(cache_block_ptr, block_rotation_coefficients_ptr, num_heads, block_size, embedding_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure hw/sw postfix is the most clear naming. Usually devs prefer opt/ref or vector/scalar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with opt/ref


template<class CT>
inline static void rotate_kv_cache_block_hw(CT* cache_block_ptr, float* block_rotation_coefficients_ptr, size_t num_heads, size_t block_size, size_t embedding_size) {
#if !defined(HAVE_AVX2) && !defined(HAVE_AVX512F)
Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't mind splitting vector and scalar impls into separate functions. I would say it is even better from code readability standpoint.

Regarding unit tests - that is actaully good developer practice. The fact we haven't implemented unit tests for intrinsics optimizations does't mean we should prohibit it at all. That seems to be convinient for developer purposes, so I am happy to have such an infrastructure.

auto cache_values_x = _mm512_undefined_ps();
auto cache_values_y = _mm512_undefined_ps();

if (!is_underutilizing) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code sample @luo-cheng2021 is reffering to is a common approach for vectorization w/o additional perf penalties.
@vshampor I would highly recommend to try proposed way and compare the performance (not necessary to do in bounds of this PR, this is rather follow up task). Should have some impact on loops with small amount of iterations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like level-zero submodule change should be reverted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions bot added category: dependency_changes Pull requests that update a dependency file and removed category: NPU OpenVINO NPU plugin labels Dec 2, 2024
@github-actions github-actions bot removed the category: dependency_changes Pull requests that update a dependency file label Dec 3, 2024
@vshampor vshampor force-pushed the token_rotation branch 2 times, most recently from 0e3ce47 to 1814957 Compare December 3, 2024 20:58
for (size_t embedding_pair_idx = 0; embedding_pair_idx < embedding_size / 2; embedding_pair_idx++) {
// NB: below is the llama-style rotation (x-like values are in the first half of the embedding vector,
// y-like values are in the second half), which is different from the original RoFormer style (x- and y-
// values are interleaved), but still preserves the relative positional encoding property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but still preserves the relative positional encoding property

Do you mean it correctly works even if a model uses interleaved mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invariant only holds if we consider the use case starting from model-train time, i.e. models with RoFormer-like, "interleaved" RoPE are supposed to be able to get trained up to roughly the same accuracy as the models with Llama-like "rotate_half" RoPE. At runtime the two are not interchangeable, of course.

Right now that is also a limitation of this cache rotation implementation - only the rotate_half version is supported, and for the proper functioning across models we should implement the rest of the RoPE variants like in

RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {
, with detection of the proper variant based on model metadata.

Comment on lines +403 to +406
auto rotation_coefficients = setName(std::make_shared<v0::Parameter>(element::f32, PartialShape{-1}),
"rotation_coefficients." + std::to_string(layer_index - 1));
auto rotated_block_indices = setName(std::make_shared<v0::Parameter>(element::i32, PartialShape{-1}),
"rotated_block_indices." + std::to_string(layer_index - 1));
Copy link
Contributor

@slyalin slyalin Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify explicit layout for rotation_coefficients. There are more details in the spec presentation, but there is an area for improvements in terms of clarity: what are these coefficient, how they are packed? Is it really interleaved/not interleaved agnostic? It is not obvious from the n-dim rotation math.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expressed my concerns about not efficient coefficient packing for PA in the related GenAI PR: openvinotoolkit/openvino.genai#987.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CI OpenVINO public CI category: Core OpenVINO Core (aka ngraph) category: CPP API OpenVINO CPP API bindings category: CPU OpenVINO CPU plugin category: GPU OpenVINO GPU plugin category: JAX FE OpenVINO JAX FrontEnd category: Python API OpenVINO Python bindings category: PyTorch FE OpenVINO PyTorch Frontend category: TF FE OpenVINO TensorFlow FrontEnd category: transformations OpenVINO Runtime library - Transformations github_actions Pull requests that update GitHub Actions code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants