[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

NishantPrabhuFujitsu · 2024-10-28T11:26:20Z

Details:

Adds SVE FP32 implementations for functions called during execution of MHASingleToken for SVE-128, SVE-256 and SVE-512 platforms.
SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon.
Adds a new implementation for exponential function exp_ps_<isa> using fewer FMA operations. Executes ~18% faster and has better output precision.

Note: I am aware of the Neon FP16 implementation of SDPA added recently. To accommodate for this, the current SVE changes will be used only if the hardware does not have ARM FP16 support. I will follow up with SVE FP16 implementations soon.

[SVE] Benchmarking results

Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp32 elements) for 1,000,000 iterations and computing average time (in micro-seconds).

Execution time of MHASingleToken as a whole was also measured for two LLMs, the results of which are shown below. For LlaMA-3-8B, the SVE-128 and SVE-512 systems at my disposal did not have enough memory, so only SVE-256 results are shown. While there is an improvement overall, these results could be contaminated with run-to-run variation due to the small execution time of the kernel.

Benchmarking details: Prompt length of 108 tokens was used; total time for generating 50 tokens was measured and average execution time was computed.

New exponential implementation

It is based on the discussion in these slides (this is based on a past talk in Fujitsu hence the document is in Japanese, sorry!). The algorithm followed is slightly different from the current implementation, in that it uses fexpa instruction available on ARM and requires only 3 Taylor expansion terms (2 FMA operations) to be precise until the 8th decimal place.

Our benchmarking results showed this implementation to be 44%-58% faster than the existing Neon implementation. It is ~18% faster than the SVE implementation of the current algorithm in Neon.

In this PR, the new implementation is called by default. The SVE port of the existing Neon implementation has also been retained, if needed.

src/plugins/intel_cpu/CMakeLists.txt

.gitignore

.gitmodules

dmitry-gorokhov · 2024-10-29T07:37:20Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp

@@ -246,6 +249,79 @@ static constexpr size_t vec_len_f16_neon = vec_len_neon / sizeof(ov::float16);
 #endif

 #ifdef OPENVINO_ARCH_ARM64
+#if defined(__ARM_FEATURE_SVE) && !defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)


Could you please clarify why we need !defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) check here?

Lets use HAVE_SVE instead of __ARM_FEATURE_SVE

It was a hotfix I had added to silence some errors when testing out my changes initially. They are no longer needed, so I have removed them in the latest commit.

Updated in the latest commit.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/mha_single_token.cpp

src/inference/src/system_conf.cpp

src/plugins/intel_cpu/CMakeLists.txt

cmake/developer_package/compile_flags/os_flags.cmake

dmitry-gorokhov · 2024-12-12T11:28:08Z

build_jenkins

dmitry-gorokhov · 2024-12-13T07:00:22Z

@NishantPrabhuFujitsu Now all tests pass well! The only remaining thing is to fix code-style job: https://github.com/openvinotoolkit/openvino/actions/runs/12295407288/job/34312381372?pr=27273.
Here is an info how to do it locally (apply clang-format): https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/coding_style.md

NishantPrabhuFujitsu · 2024-12-13T10:08:00Z

@dmitry-gorokhov I've followed the instructions in the documentation you shared and a lot of files had indentation-like changes in the end. I've not used this before so I'm not 100% sure if the changes have come in correctly (~70 files had changes). I'll wait to see if the pipelines are still succeeding.

dmitry-gorokhov · 2024-12-13T10:29:56Z

@dmitry-gorokhov I've followed the instructions in the documentation you shared and a lot of files had indentation-like changes in the end. I've not used this before so I'm not 100% sure if the changes have come in correctly (~70 files had changes). I'll wait to see if the pipelines are still succeeding.

Yeah, that is expected. We haven't applied clang-format for ARM files yet.
Code style job passes now.

dmitry-gorokhov · 2024-12-13T10:30:18Z

build_jenkins

dmitry-gorokhov · 2024-12-13T11:17:56Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/jit_uni_eltwise_generic.cpp

+              OV_CASE(Algorithm::EltwiseIsInf, jit_is_inf_emitter),
+              OV_CASE(Algorithm::EltwiseIsNaN, jit_is_nan_emitter),
+              OV_CASE(Algorithm::EltwiseLessEqual, jit_less_equal_emitter),
+              OV_CASE(Algorithm::EltwiseLogicalAnd, jit_logical_and_emitter),


For some reason EltwiseLogicalOr is deleted from this list. It leads to failed tests. Please return it back.

Not sure what caused this. It's not the style check because after adding it back and running the style check, it didn't get removed. However I did notice it disappear momentarily and reappear when I was rebasing my branch after syncing master. Anyway, it should be there this time.

By the way, this indeed is happening due to clang-format. When my colleague built #27841 with clang_format_fix_all, EltwiseOr disappeared from both places. Not sure about the reason for this behavior though.

dmitry-gorokhov · 2024-12-13T11:18:10Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/jit_uni_eltwise_generic.cpp

+        OV_CASE(Algorithm::EltwiseIsInf, ov::intel_cpu::aarch64::jit_is_inf_emitter),
+        OV_CASE(Algorithm::EltwiseLessEqual, ov::intel_cpu::aarch64::jit_less_equal_emitter),
+        OV_CASE(Algorithm::EltwiseLogicalAnd, ov::intel_cpu::aarch64::jit_logical_and_emitter),
+        OV_CASE(Algorithm::EltwiseLogicalNot, ov::intel_cpu::aarch64::jit_logical_not_emitter),


For some reason EltwiseLogicalOr is deleted from this list. It leads to failed tests. Please return it back.

Resolved in latest commit.

dmitry-gorokhov · 2024-12-16T04:55:56Z

build_jenkins

NishantPrabhuFujitsu · 2024-12-16T05:38:27Z

https://github.com/openvinotoolkit/openvino/actions/runs/12317566674/job/34451409053?pr=27273#step:12:104

Any idea why this is happening? I didn't make any changes to os_flags.cmake in the last few commits. Same error in one of the required checks too.

dmitry-gorokhov · 2024-12-16T05:48:04Z

https://github.com/openvinotoolkit/openvino/actions/runs/12317566674/job/34451409053?pr=27273#step:12:104

Any idea why this is happening? I didn't make any changes to os_flags.cmake in the last few commits. Same error in one of the required checks too.

Looks like CI issue. I restarted the job.

NishantPrabhuFujitsu · 2024-12-17T03:45:59Z

There seems to be some issue with a unit test in PaddlePaddle front end (https://github.com/openvinotoolkit/openvino/actions/runs/12355815264/job/34483519304#step:17:1410) and building OpenVINO tokenizers (https://github.com/openvinotoolkit/openvino/actions/runs/12355815241/job/34488653609#step:10:138). The second one doesn't look like a code issue to me, but any idea about the first one?

…oolkit#27273) ### Details: - Adds SVE FP32 implementations for functions called during execution of `MHASingleToken` for SVE-128, SVE-256 and SVE-512 platforms. - SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon. - Adds a new implementation for exponential function `exp_ps_<isa>` using fewer FMA operations. Executes ~18% faster and has better output precision. **Note:** I am aware of the Neon FP16 implementation of SDPA added recently. To accommodate for this, the current SVE changes will be used only if the hardware does not have ARM FP16 support. I will follow up with SVE FP16 implementations soon. ### [SVE] Benchmarking results Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp32 elements) for 1,000,000 iterations and computing average time (in micro-seconds). ![image](https://github.com/user-attachments/assets/3f82238f-af7e-4b68-b4b1-259cf389e41a) Execution time of `MHASingleToken` as a whole was also measured for two LLMs, the results of which are shown below. For LlaMA-3-8B, the SVE-128 and SVE-512 systems at my disposal did not have enough memory, so only SVE-256 results are shown. While there is an improvement overall, these results could be contaminated with run-to-run variation due to the small execution time of the kernel. **Benchmarking details:** Prompt length of 108 tokens was used; total time for generating 50 tokens was measured and average execution time was computed. ![image](https://github.com/user-attachments/assets/893c1c46-085f-46af-ab5a-2c1481c75f68) ### New exponential implementation It is based on the discussion in [these slides](https://www.slideshare.net/slideshow/hpc-phys20201203/239717194#23) (this is based on a past talk in Fujitsu hence the document is in Japanese, sorry!). The algorithm followed is slightly different from the current implementation, in that it uses `fexpa` instruction available on ARM and requires only 3 Taylor expansion terms (2 FMA operations) to be precise until the 8th decimal place. Our benchmarking results showed this implementation to be 44%-58% faster than the existing Neon implementation. It is ~18% faster than the SVE implementation of the current algorithm in Neon. ![image](https://github.com/user-attachments/assets/117df21d-3977-499c-8ab8-8f4346286113) In this PR, the new implementation is called by default. The SVE port of the existing Neon implementation has also been retained, if needed.

### Details In continuation with #27273, adds SVE FP16 implementations for functions called during execution of MHASingleToken for SVE-128, SVE-256 and SVE-512 platforms. SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon. ### Benchmarking results Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp16 elements) for 1,000,000 iterations and computing average time (in micro-seconds). ![image](https://github.com/user-attachments/assets/85efd5d9-da91-4d46-a1c3-82a440d17470)

NishantPrabhuFujitsu requested review from a team as code owners October 28, 2024 11:26

NishantPrabhuFujitsu requested review from ilya-lavrenov and removed request for a team October 28, 2024 11:26

github-actions bot added category: CPU OpenVINO CPU plugin category: dependency_changes Pull requests that update a dependency file no-match-files category: NPU OpenVINO NPU plugin labels Oct 28, 2024

sys-openvino-ci added the ExternalPR External contributor label Oct 28, 2024

dmitry-gorokhov self-assigned this Oct 28, 2024

github-actions bot added the category: build OpenVINO cmake script / infra label Oct 28, 2024

ilya-lavrenov added the platform: arm OpenVINO on ARM / ARM64 label Oct 28, 2024

ilya-lavrenov reviewed Oct 28, 2024

View reviewed changes

src/plugins/intel_cpu/CMakeLists.txt Outdated Show resolved Hide resolved

src/plugins/intel_cpu/CMakeLists.txt Show resolved Hide resolved

dmitry-gorokhov reviewed Oct 29, 2024

View reviewed changes

NishantPrabhuFujitsu requested review from a team as code owners October 29, 2024 11:12

github-actions bot added category: inference OpenVINO Runtime library - Inference category: GPU OpenVINO GPU plugin category: Python API OpenVINO Python bindings and removed no-match-files labels Oct 29, 2024

ilya-lavrenov reviewed Oct 29, 2024

View reviewed changes

src/inference/src/system_conf.cpp Outdated Show resolved Hide resolved

ilya-lavrenov reviewed Oct 29, 2024

View reviewed changes

src/plugins/intel_cpu/CMakeLists.txt Outdated Show resolved Hide resolved

ilya-lavrenov reviewed Oct 29, 2024

View reviewed changes

cmake/developer_package/compile_flags/os_flags.cmake Outdated Show resolved Hide resolved

NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from be60fbc to a81bf1e Compare December 12, 2024 11:24

NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from 9dca018 to 1f13550 Compare December 13, 2024 10:04

dmitry-gorokhov reviewed Dec 13, 2024

View reviewed changes

NishantPrabhuFujitsu added 6 commits December 13, 2024 20:02

[CPU][ARM] Adds SVE F32 implementation for MHASingleToken sub-functions

6182953

fixes dtype conflict in comparison

0ef10c2

removed SVE check for C; cmake cleanup

30a2a56

arch list fix in intel_cpu cmakelists

eee70bc

clang code style fixes

a0c705f

added elementwise logical or

1407cfb

NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from 13d31e1 to 1407cfb Compare December 13, 2024 14:34

dmitry-gorokhov enabled auto-merge December 16, 2024 09:25

dmitry-gorokhov added this pull request to the merge queue Dec 16, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 16, 2024

dmitry-gorokhov added this pull request to the merge queue Dec 17, 2024

Merged via the queue into openvinotoolkit:master with commit b543d0b Dec 17, 2024
187 checks passed

dmitry-gorokhov mentioned this pull request Dec 17, 2024

Aarch64 paged attention enablement #27841

Merged

NishantPrabhuFujitsu mentioned this pull request Dec 23, 2024

[CPU] [ARM] SVE FP16 functions for MHASingleToken kernel #28182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

NishantPrabhuFujitsu commented Oct 28, 2024 •

edited

Loading

dmitry-gorokhov Oct 29, 2024

NishantPrabhuFujitsu Oct 29, 2024 •

edited

Loading

dmitry-gorokhov commented Dec 12, 2024

dmitry-gorokhov commented Dec 13, 2024

NishantPrabhuFujitsu commented Dec 13, 2024 •

edited

Loading

dmitry-gorokhov commented Dec 13, 2024 •

edited

Loading

dmitry-gorokhov commented Dec 13, 2024

dmitry-gorokhov Dec 13, 2024

NishantPrabhuFujitsu Dec 13, 2024 •

edited

Loading

NishantPrabhuFujitsu Dec 16, 2024

dmitry-gorokhov Dec 13, 2024

NishantPrabhuFujitsu Dec 13, 2024

dmitry-gorokhov commented Dec 16, 2024

NishantPrabhuFujitsu commented Dec 16, 2024 •

edited

Loading

dmitry-gorokhov commented Dec 16, 2024

NishantPrabhuFujitsu commented Dec 17, 2024

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

Conversation

NishantPrabhuFujitsu commented Oct 28, 2024 • edited Loading

Details:

[SVE] Benchmarking results

New exponential implementation

dmitry-gorokhov Oct 29, 2024

Choose a reason for hiding this comment

NishantPrabhuFujitsu Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

dmitry-gorokhov commented Dec 12, 2024

dmitry-gorokhov commented Dec 13, 2024

NishantPrabhuFujitsu commented Dec 13, 2024 • edited Loading

dmitry-gorokhov commented Dec 13, 2024 • edited Loading

dmitry-gorokhov commented Dec 13, 2024

dmitry-gorokhov Dec 13, 2024

Choose a reason for hiding this comment

NishantPrabhuFujitsu Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

NishantPrabhuFujitsu Dec 16, 2024

Choose a reason for hiding this comment

dmitry-gorokhov Dec 13, 2024

Choose a reason for hiding this comment

NishantPrabhuFujitsu Dec 13, 2024

Choose a reason for hiding this comment

dmitry-gorokhov commented Dec 16, 2024

NishantPrabhuFujitsu commented Dec 16, 2024 • edited Loading

dmitry-gorokhov commented Dec 16, 2024

NishantPrabhuFujitsu commented Dec 17, 2024

NishantPrabhuFujitsu commented Oct 28, 2024 •

edited

Loading

NishantPrabhuFujitsu Oct 29, 2024 •

edited

Loading

NishantPrabhuFujitsu commented Dec 13, 2024 •

edited

Loading

dmitry-gorokhov commented Dec 13, 2024 •

edited

Loading

NishantPrabhuFujitsu Dec 13, 2024 •

edited

Loading

NishantPrabhuFujitsu commented Dec 16, 2024 •

edited

Loading