Armv8-A Row-major Kernel Improvements #698

xrq-phys · 2022-12-16T17:05:20Z

Status
This is a 8x6 row-major kernel for ARMv8-A so its internal structure is basically the same as the current 6x8 column-preferring one.

Updates

Instead of clearing C-microtile registers at beginning of the assembly, execute the first k-loop using fmul instead of fmla. Codepath within assembly is handled to (basically) not introduce additional branching cost.
Scatter prefetching code for C into microkernel loops.

Restrictions
This kernel assumes hardware prefetching for packed A/B blocks (so as not to bother the pipeline with additional instructions or the DMA with additional loads).
Older chips like ThunderX2 may not perform well with it since they may have no hardware prefetching at all, while newer ones like Amazon's C6g tend to be happier with it.

This update also contains somehow prerequisite changes for my gemmsup+packm work here which I'd also like to merge later as a BLIS sandbox.

- Only DGEMM at this moment. - Prefetch whole lines. - Scatter prefetching insts.

Instead of clearing C rows, Deploy first-k FMUL so that instructions are saved.

Instead of loading from stack, directly pass regs in. Arm64 has 30 regs for use. This may or may not speed up a tiny bit.

Forget to commit header for ad73717.

- Init k-loop clears C. - Scattered C preloading.

fgvanzee · 2022-12-26T00:47:58Z

Thanks @xrq-phys! I've asked Jeff to take a look at the new kernel for feedback. I think he and his application could stand to benefit from this, given the inherent advantage row-preferring kernel have with left-sided trsm (which is the only trsm code path that BLIS implements).

Happy holidays! 🎄 🎁 🍾

GodTamIt · 2023-10-03T10:52:44Z

Hi there, I know this is a bit old but came across this change from this paper.

I was just wondering what the status was for having this (and other changes) merged upstream and/or if there was a plan to do so?

fgvanzee · 2023-10-03T18:54:27Z

Hey @GodTamIt, thanks for your inquiry. I guess we're still waiting on @jdiamondGitHub to look over this PR. I'll reach out to him separately as well.

Details: - Integrated changes from PR #698 to enable testing in the context of the 'stable' branch. These changes add row-preferential sgemm and dgemm microkernels for the armv8a kernel set. - Updated the 'altra' subconfig to easily switch between the previous (column-preferential) ukernel and the aforementioned row-pref ukernel.

xrq-phys added 5 commits November 21, 2022 02:39

Arm NEON Improve C-Prefetching for DGEMM

e8068f6

- Only DGEMM at this moment. - Prefetch whole lines. - Scatter prefetching insts.

Arm NEON Init. Opt. For DGEMM

ad73717

Instead of clearing C rows, Deploy first-k FMUL so that instructions are saved.

Arm NEON DGEMM Change Regs IO

0ddde0f

Instead of loading from stack, directly pass regs in. Arm64 has 30 regs for use. This may or may not speed up a tiny bit.

Fix Init. Bug

04b5b71

Forget to commit header for ad73717.

Armv8-A Port Row-maj DGEMM Uker Changes to SGEMM

47c63c1

- Init k-loop clears C. - Scattered C preloading.

fgvanzee requested a review from jdiamondGitHub December 26, 2022 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Armv8-A Row-major Kernel Improvements #698

Armv8-A Row-major Kernel Improvements #698

xrq-phys commented Dec 16, 2022 •

edited

Loading

fgvanzee commented Dec 26, 2022

GodTamIt commented Oct 3, 2023

fgvanzee commented Oct 3, 2023

Armv8-A Row-major Kernel Improvements #698

Are you sure you want to change the base?

Armv8-A Row-major Kernel Improvements #698

Conversation

xrq-phys commented Dec 16, 2022 • edited Loading

fgvanzee commented Dec 26, 2022

GodTamIt commented Oct 3, 2023

fgvanzee commented Oct 3, 2023

xrq-phys commented Dec 16, 2022 •

edited

Loading