-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273
[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273
Conversation
@@ -246,6 +249,79 @@ static constexpr size_t vec_len_f16_neon = vec_len_neon / sizeof(ov::float16); | |||
#endif | |||
|
|||
#ifdef OPENVINO_ARCH_ARM64 | |||
#if defined(__ARM_FEATURE_SVE) && !defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you please clarify why we need
!defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
check here? - Lets use
HAVE_SVE
instead of__ARM_FEATURE_SVE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It was a hotfix I had added to silence some errors when testing out my changes initially. They are no longer needed, so I have removed them in the latest commit.
- Updated in the latest commit.
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/mha_single_token.cpp
Outdated
Show resolved
Hide resolved
be60fbc
to
a81bf1e
Compare
build_jenkins |
@NishantPrabhuFujitsu Now all tests pass well! The only remaining thing is to fix code-style job: https://github.com/openvinotoolkit/openvino/actions/runs/12295407288/job/34312381372?pr=27273. |
9dca018
to
1f13550
Compare
@dmitry-gorokhov I've followed the instructions in the documentation you shared and a lot of files had indentation-like changes in the end. I've not used this before so I'm not 100% sure if the changes have come in correctly (~70 files had changes). I'll wait to see if the pipelines are still succeeding. |
Yeah, that is expected. We haven't applied clang-format for ARM files yet. |
build_jenkins |
OV_CASE(Algorithm::EltwiseIsInf, jit_is_inf_emitter), | ||
OV_CASE(Algorithm::EltwiseIsNaN, jit_is_nan_emitter), | ||
OV_CASE(Algorithm::EltwiseLessEqual, jit_less_equal_emitter), | ||
OV_CASE(Algorithm::EltwiseLogicalAnd, jit_logical_and_emitter), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason EltwiseLogicalOr is deleted from this list. It leads to failed tests. Please return it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what caused this. It's not the style check because after adding it back and running the style check, it didn't get removed. However I did notice it disappear momentarily and reappear when I was rebasing my branch after syncing master. Anyway, it should be there this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, this indeed is happening due to clang-format
. When my colleague built #27841 with clang_format_fix_all
, EltwiseOr
disappeared from both places. Not sure about the reason for this behavior though.
OV_CASE(Algorithm::EltwiseIsInf, ov::intel_cpu::aarch64::jit_is_inf_emitter), | ||
OV_CASE(Algorithm::EltwiseLessEqual, ov::intel_cpu::aarch64::jit_less_equal_emitter), | ||
OV_CASE(Algorithm::EltwiseLogicalAnd, ov::intel_cpu::aarch64::jit_logical_and_emitter), | ||
OV_CASE(Algorithm::EltwiseLogicalNot, ov::intel_cpu::aarch64::jit_logical_not_emitter), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason EltwiseLogicalOr is deleted from this list. It leads to failed tests. Please return it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in latest commit.
13d31e1
to
1407cfb
Compare
build_jenkins |
Any idea why this is happening? I didn't make any changes to |
Looks like CI issue. I restarted the job. |
There seems to be some issue with a unit test in PaddlePaddle front end (https://github.com/openvinotoolkit/openvino/actions/runs/12355815264/job/34483519304#step:17:1410) and building OpenVINO tokenizers (https://github.com/openvinotoolkit/openvino/actions/runs/12355815241/job/34488653609#step:10:138). The second one doesn't look like a code issue to me, but any idea about the first one? |
…oolkit#27273) ### Details: - Adds SVE FP32 implementations for functions called during execution of `MHASingleToken` for SVE-128, SVE-256 and SVE-512 platforms. - SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon. - Adds a new implementation for exponential function `exp_ps_<isa>` using fewer FMA operations. Executes ~18% faster and has better output precision. **Note:** I am aware of the Neon FP16 implementation of SDPA added recently. To accommodate for this, the current SVE changes will be used only if the hardware does not have ARM FP16 support. I will follow up with SVE FP16 implementations soon. ### [SVE] Benchmarking results Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp32 elements) for 1,000,000 iterations and computing average time (in micro-seconds). ![image](https://github.com/user-attachments/assets/3f82238f-af7e-4b68-b4b1-259cf389e41a) Execution time of `MHASingleToken` as a whole was also measured for two LLMs, the results of which are shown below. For LlaMA-3-8B, the SVE-128 and SVE-512 systems at my disposal did not have enough memory, so only SVE-256 results are shown. While there is an improvement overall, these results could be contaminated with run-to-run variation due to the small execution time of the kernel. **Benchmarking details:** Prompt length of 108 tokens was used; total time for generating 50 tokens was measured and average execution time was computed. ![image](https://github.com/user-attachments/assets/893c1c46-085f-46af-ab5a-2c1481c75f68) ### New exponential implementation It is based on the discussion in [these slides](https://www.slideshare.net/slideshow/hpc-phys20201203/239717194#23) (this is based on a past talk in Fujitsu hence the document is in Japanese, sorry!). The algorithm followed is slightly different from the current implementation, in that it uses `fexpa` instruction available on ARM and requires only 3 Taylor expansion terms (2 FMA operations) to be precise until the 8th decimal place. Our benchmarking results showed this implementation to be 44%-58% faster than the existing Neon implementation. It is ~18% faster than the SVE implementation of the current algorithm in Neon. ![image](https://github.com/user-attachments/assets/117df21d-3977-499c-8ab8-8f4346286113) In this PR, the new implementation is called by default. The SVE port of the existing Neon implementation has also been retained, if needed.
### Details In continuation with #27273, adds SVE FP16 implementations for functions called during execution of MHASingleToken for SVE-128, SVE-256 and SVE-512 platforms. SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon. ### Benchmarking results Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp16 elements) for 1,000,000 iterations and computing average time (in micro-seconds). ![image](https://github.com/user-attachments/assets/85efd5d9-da91-4d46-a1c3-82a440d17470)
Details:
MHASingleToken
for SVE-128, SVE-256 and SVE-512 platforms.exp_ps_<isa>
using fewer FMA operations. Executes ~18% faster and has better output precision.Note: I am aware of the Neon FP16 implementation of SDPA added recently. To accommodate for this, the current SVE changes will be used only if the hardware does not have ARM FP16 support. I will follow up with SVE FP16 implementations soon.
[SVE] Benchmarking results
Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp32 elements) for 1,000,000 iterations and computing average time (in micro-seconds).
Execution time of
MHASingleToken
as a whole was also measured for two LLMs, the results of which are shown below. For LlaMA-3-8B, the SVE-128 and SVE-512 systems at my disposal did not have enough memory, so only SVE-256 results are shown. While there is an improvement overall, these results could be contaminated with run-to-run variation due to the small execution time of the kernel.Benchmarking details: Prompt length of 108 tokens was used; total time for generating 50 tokens was measured and average execution time was computed.
New exponential implementation
It is based on the discussion in these slides (this is based on a past talk in Fujitsu hence the document is in Japanese, sorry!). The algorithm followed is slightly different from the current implementation, in that it uses
fexpa
instruction available on ARM and requires only 3 Taylor expansion terms (2 FMA operations) to be precise until the 8th decimal place.Our benchmarking results showed this implementation to be 44%-58% faster than the existing Neon implementation. It is ~18% faster than the SVE implementation of the current algorithm in Neon.
In this PR, the new implementation is called by default. The SVE port of the existing Neon implementation has also been retained, if needed.