-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Armv8-A Row-major Kernel Improvements #698
base: master
Are you sure you want to change the base?
Conversation
- Only DGEMM at this moment. - Prefetch whole lines. - Scatter prefetching insts.
Instead of clearing C rows, Deploy first-k FMUL so that instructions are saved.
Instead of loading from stack, directly pass regs in. Arm64 has 30 regs for use. This may or may not speed up a tiny bit.
Forget to commit header for ad73717.
- Init k-loop clears C. - Scattered C preloading.
Thanks @xrq-phys! I've asked Jeff to take a look at the new kernel for feedback. I think he and his application could stand to benefit from this, given the inherent advantage row-preferring kernel have with left-sided Happy holidays! 🎄 🎁 🍾 |
Hi there, I know this is a bit old but came across this change from this paper. I was just wondering what the status was for having this (and other changes) merged upstream and/or if there was a plan to do so? |
Hey @GodTamIt, thanks for your inquiry. I guess we're still waiting on @jdiamondGitHub to look over this PR. I'll reach out to him separately as well. |
Details: - Integrated changes from PR #698 to enable testing in the context of the 'stable' branch. These changes add row-preferential sgemm and dgemm microkernels for the armv8a kernel set. - Updated the 'altra' subconfig to easily switch between the previous (column-preferential) ukernel and the aforementioned row-pref ukernel.
Status
This is a 8x6 row-major kernel for ARMv8-A so its internal structure is basically the same as the current 6x8 column-preferring one.
Updates
k
-loop usingfmul
instead offmla
. Codepath within assembly is handled to (basically) not introduce additional branching cost.Restrictions
This kernel assumes hardware prefetching for packed A/B blocks (so as not to bother the pipeline with additional instructions or the DMA with additional loads).
Older chips like ThunderX2 may not perform well with it since they may have no hardware prefetching at all, while newer ones like Amazon's C6g tend to be happier with it.
This update also contains somehow prerequisite changes for my
gemmsup+packm
work here which I'd also like to merge later as a BLIS sandbox.