Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from bluss:master #3

Open
wants to merge 117 commits into
base: master
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented Dec 7, 2020

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Dec 7, 2020
Add fast test switch so we can skip big matrices in cross tests
Add github actions to replace travis
This binary makes it easy to run custom benchmarks of bigger matrices
with custom size, layout and threading.
Add benchmark runner as an "example" binary
For std and threading we can use the thread_local!() macro, but for
no-std we'll need to use a stack array instead.
It depends a lot on hardware, when we should use threads, so even having
a heuristic is risky, but we'll add one and can improve it later.
On non-x86, this macro can be unused.
These are used all the time for profiling (and only affects development,
so they might as well be enabled.)
This is a performance fix, using one Lazy/OnceCell instead of two
separate ones saves a little time - it's just a few ns - which was
visible in the benchmark for (too) small matrices.
Use repr(align(x)) so we don't have to oversize and manually align the mask
buffer. Also use an UnsafeCell to remove (the very small, few ns)
overhead of borrowing the RefCell. (Its borrowing was pointless anyway,
since we held the raw pointer much longer than RefCell "borrow".)
bluss and others added 30 commits April 21, 2023 22:19
Remove the bias factor for matrix size for aarch64 in computing max number
of threads for a given matrix.
This function is a special case and should never be inlined, so put it out
to the side.
Using a reference type (such as a slice) for either pack or a in the packing
function makes rustc emit a noalias annotation for that pointer, and that helps
the optimizer in some cases.

What we want is that the compiler sees that the pointers pack and a and
pointers derived from them, can never alias, then it has more freedom to
rewrite the operations in the packing loops.  The pack buffer is contiguous so
it's the only choice for passing one of the two arguments as a slice.

Shown to slightly speed up the layout_f32 benchmark for sgemm, not dgemm, on
M1.

A way to get the same effect without a slice would be good for this crate,
like a 'restrict' keyword.
Check rust version in build script and conditionally use new kernels.

One could choose to just raise msrv specifically for aarch64 to 1.61 here,
but being unobtrusive should be the right thing to do. There are likely more
version sensitive features coming because of the subject (simd, asm).

Autocfg seems like one of the best choices for version check. It's already
used by num crates.
Arm64/AArch64 Neon kernels
For complex, we'll want to use a different packing function.
Add packing into the GemmKernel interface so that kernels can request a
different packing function. The standard packing function is unchanged but
gets its own module in the code.
Use a different pack function for complex micorkernels which puts real and
imag parts in separate rows. This enables much better autovectorization for
the fallback kernels.
Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does
better than fma here, so both can be worthwhile.
Because we detect target features to select kernel, and the kernel
can select its own packing functions, we can now specialize the packing
functions per target.

As matrices get larger, the packing performance matters much less, but
for small matrix products it contributes more to the runtime.

The default packing also already has a special case for contiguous
matrices, which happens when in C = A B, A is column major and B is row
major. The specialization in this commit helps the most outside this
special case.
avx2, fma and f32::mul_add is a success in autovectorization, while just
fma with f32::mul_add is not (!).

For this reason, only call f32::mul_add when we opt in to this.
Remove flags that are now used by default by miri.
cgemm was not tested as nostd in ci
For Bazel compatibility. Fixes #78
in layout benchmarks, which are used to check packing and kernel sensitivity
to memory layout, test some non-contiguous layouts.
A user showed that in certain configurations on macos, the TLS allocation can
even be 8-byte aligned.
Cargo cross does not support this old rust version anymore, increase
cross versions.
Completely distrust repr(align()) on macos and always manually ensure basic
alignment.
Requested 32-alignment for s390x but thread local storage does not
supply it. Lower requested align to 16 in general to avoid having this
problem pop up on other platforms too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⤵️ pull merge-conflict Resolve conflicts manually
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants