forked from bluss/matrixmultiply
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from bluss:master #3
Open
pull
wants to merge
117
commits into
mesalock-linux:master
Choose a base branch
from
bluss:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Geordon Worley <[email protected]>
no_std support
Add fast test switch so we can skip big matrices in cross tests
Add github actions to replace travis
This binary makes it easy to run custom benchmarks of bigger matrices with custom size, layout and threading.
Add benchmark runner as an "example" binary
For std and threading we can use the thread_local!() macro, but for no-std we'll need to use a stack array instead.
It depends a lot on hardware, when we should use threads, so even having a heuristic is risky, but we'll add one and can improve it later.
On non-x86, this macro can be unused.
These are used all the time for profiling (and only affects development, so they might as well be enabled.)
This is a performance fix, using one Lazy/OnceCell instead of two separate ones saves a little time - it's just a few ns - which was visible in the benchmark for (too) small matrices.
Use repr(align(x)) so we don't have to oversize and manually align the mask buffer. Also use an UnsafeCell to remove (the very small, few ns) overhead of borrowing the RefCell. (Its borrowing was pointless anyway, since we held the raw pointer much longer than RefCell "borrow".)
Remove the bias factor for matrix size for aarch64 in computing max number of threads for a given matrix.
This function is a special case and should never be inlined, so put it out to the side.
Using a reference type (such as a slice) for either pack or a in the packing function makes rustc emit a noalias annotation for that pointer, and that helps the optimizer in some cases. What we want is that the compiler sees that the pointers pack and a and pointers derived from them, can never alias, then it has more freedom to rewrite the operations in the packing loops. The pack buffer is contiguous so it's the only choice for passing one of the two arguments as a slice. Shown to slightly speed up the layout_f32 benchmark for sgemm, not dgemm, on M1. A way to get the same effect without a slice would be good for this crate, like a 'restrict' keyword.
Check rust version in build script and conditionally use new kernels. One could choose to just raise msrv specifically for aarch64 to 1.61 here, but being unobtrusive should be the right thing to do. There are likely more version sensitive features coming because of the subject (simd, asm). Autocfg seems like one of the best choices for version check. It's already used by num crates.
Arm64/AArch64 Neon kernels
For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.
Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.
Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.
Because we detect target features to select kernel, and the kernel can select its own packing functions, we can now specialize the packing functions per target. As matrices get larger, the packing performance matters much less, but for small matrix products it contributes more to the runtime. The default packing also already has a special case for contiguous matrices, which happens when in C = A B, A is column major and B is row major. The specialization in this commit helps the most outside this special case.
avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.
Remove flags that are now used by default by miri.
cgemm was not tested as nostd in ci
For Bazel compatibility. Fixes #78
in layout benchmarks, which are used to check packing and kernel sensitivity to memory layout, test some non-contiguous layouts.
A user showed that in certain configurations on macos, the TLS allocation can even be 8-byte aligned.
Cargo cross does not support this old rust version anymore, increase cross versions.
Completely distrust repr(align()) on macos and always manually ensure basic alignment.
Requested 32-alignment for s390x but thread local storage does not supply it. Lower requested align to 16 in general to avoid having this problem pop up on other platforms too.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )