[pull] master from bluss:master #3

Co-authored-by: Geordon Worley <[email protected]>

no_std support

Add fast test switch so we can skip big matrices in cross tests

Add github actions to replace travis

This binary makes it easy to run custom benchmarks of bigger matrices with custom size, layout and threading.

…ling

Add benchmark runner as an "example" binary

…ads)

…bled

For std and threading we can use the thread_local!() macro, but for no-std we'll need to use a stack array instead.

It depends a lot on hardware, when we should use threads, so even having a heuristic is risky, but we'll add one and can improve it later.

On non-x86, this macro can be unused.

These are used all the time for profiling (and only affects development, so they might as well be enabled.)

This is a performance fix, using one Lazy/OnceCell instead of two separate ones saves a little time - it's just a few ns - which was visible in the benchmark for (too) small matrices.

Use repr(align(x)) so we don't have to oversize and manually align the mask buffer. Also use an UnsafeCell to remove (the very small, few ns) overhead of borrowing the RefCell. (Its borrowing was pointless anyway, since we held the raw pointer much longer than RefCell "borrow".)

Read the kernel parameters and pass to a non-generic function - we don't need to duplicate it for each kernel implementation.

Follow the community standards of not having version updates as breaking changes. We still want to be careful.

Add threading support

This fixes an older experiment - using &T to get a dereferenceable unaliased pointer - to instead using &[T], which is the only correct way to do it since more than one element is accessed from the pointer.

Align mask buffer pointer manually

No impact on functionality.

Be a bit more eager to use threads (there is a heuristic vs matrix size).

Use the feature name "cgemm" for cgemm/zgemm methods; start them off by adding fallback implementations using 4x2 kernels. CGemmOptions added as a placeholder - can later include options for conjugating either operand (transpose not required - the strides provide that freedom already). Also update the benchmark. Complex is using the representation [f64; 2] here which is representation compatible in memory with C and with num_complex.

Combine the cgemm, zgemm kernels. Break out components and compute the real/imag parts separately. This has especially large gains for FMA. A separate approach was also tried, using (P + Qi)(R + Si) and factoring into three separate multiplications (instead of four), and this was a benefit for cgemm, but not as big a benefit as using the simpler kernel in this PR, compiled using FMA.

Compiling them separately gives some gains - especially for C32 on my configuration, a doubling in performance.

The estimate here is gflop = 2 MNK for floats (common estimate) and we use gflop = 8 MNK for complex.

Print f32 etc for float and c32 etc for complex

Not a good one, but a better one than there was. So that one additional argument can be added.

This is useful when taking data - an extra column with more information

include!() was chosen for sharing this code. An internal crate could have been used, but it has the downside that the interal crate doesn't become part of the source package published on crates.io, which is nonideal (just a minor thing, but still). In the spirit of open source, the source package should contain the preferred setup for working with the project, and that includes the tests. With this solution, the tests are still buildable from the published package.

Reduce duplication, now that we have 4 gemm kernel files. The ensurefeature does not need to be duplicated.

Use complex scalars for alpha and beta to cover this better in the testsuite.

Increases test coverage by testing identity multiply on both sides.

Add experimental support for complex: cgemm/zgemm

This introduces compile-time tweak variables like this: - MATMUL_DGEMM_NC - MATMUL_DGEMM_MC - MATMUL_DGEMM_KC etc for each kernel. These allow setting these size parameters at compile time - they should ideally be optimized per kernel *and microarch*. Combine these parameters with the benchmark in ./examples/benchmark.rs and its csv output option - this allows optimizing performance depending on these parameters. Using DutchGhost's const parsing code from https://gist.github.com/DutchGhost/d8604a3c796479777fe9f5e25d855cfd which has been very useful. Co-authored-by: DutchGhost <[email protected]>

This script can vary over most parameters (threads, nc, kc, mc, types, sizes) to create benchmarks.

- Threading is supported in miri, except the num_cpus::get_physical needs the -Zmiri-disable-isolation flag - Miri is extremely slow at running the full unoptimized gemm loop unfortunately, any non-trivial matrix sizes are skipped in tests (which is a shame, there are more branches to cover for larger sizes).

There was an overflow in the pow10 table on 32-bit arches, try to fix this.

Just to make sure it continues to work

The new matrix compare looks at equality within a tolerance. As-written, the testsuite would end up with only exact integer floats, or additions where it loses precisions in the expected way. When changing KC to a smaller value or, when testing larger matrices, because this impacts loop 4 and we do one update (+=) to C per iteration of loop 4 - then this can be visible in the precision or rounding error of the result. Thus, with varying KC - we must have a relative tolerance for equality.

Since these are now compile-time configurable, we need some limits on them. We are about 99% sure we have correct results even if we vary these parameters wildly. But it should be clear they are configurable for parameter exploration and optimization.

Add a "definitely non-Send/Sync" field to Ptr so that the explicit Send/Sync impls become unambiguous to the compiler. This fixes the warning from rust-lang/rust#93367

Before this PR, running `MIRIFLAGS="-Zmiri-tag-raw-pointers" cargo miri test` caused Miri to report undefined behavior in the `test_dgemm` test. This PR fixes the underlying issue – Miri doesn't like us using a reference to an element to access other elements.

* Updated comment in function kernel_x86_avx to reflect actual procedure where permutations of a and b are generated. Also updated possible alternative selections mentioned in the comment for the operation: '''let b_3210 = _mm256_permute2f128_pd(b_1032, b_1032, 0x03);''' According to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6418,5227&techs=AVX,AVX2&cats=Swizzle and subsequent testing alternatives for the selection are equivalent for equal source vectors if the second bit in the nibbles is switched. * Removed redundant comment Removed two lines of comments, which basically repeated the code below. Also changed a hexadecimal to a binary value for a mask value in a SIMD intrinsic to increase readability.

Revisit the previous fix. It's clear that the Ptr code will not change meaning with the coming breaking change, because it's only ever used with Ptr<*const T> and Ptr<*mut T> as parameterizations. For this reason, it's more correct IMO to accept the code meaning change, which doesn't change the meaning of the crate, by silencing the warning and not making any unnecessary and ugly changes to the struct fields.

As a second fix for the TLS on macos alignment problem, request only 16-byte alignment on macos, because we don't get more. There's a new debug assertion in std which trips on accessing otherwise, even if it does not really affect us.

The Range::next() method showed up uninlined in the benchmark, and this was a measuarable (~10%) improvement.

Remove the bias factor for matrix size for aarch64 in computing max number of threads for a given matrix.

This function is a special case and should never be inlined, so put it out to the side.

Using a reference type (such as a slice) for either pack or a in the packing function makes rustc emit a noalias annotation for that pointer, and that helps the optimizer in some cases. What we want is that the compiler sees that the pointers pack and a and pointers derived from them, can never alias, then it has more freedom to rewrite the operations in the packing loops. The pack buffer is contiguous so it's the only choice for passing one of the two arguments as a slice. Shown to slightly speed up the layout_f32 benchmark for sgemm, not dgemm, on M1. A way to get the same effect without a slice would be good for this crate, like a 'restrict' keyword.

Check rust version in build script and conditionally use new kernels. One could choose to just raise msrv specifically for aarch64 to 1.61 here, but being unobtrusive should be the right thing to do. There are likely more version sensitive features coming because of the subject (simd, asm). Autocfg seems like one of the best choices for version check. It's already used by num crates.

Arm64/AArch64 Neon kernels

For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.

Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.

Because we detect target features to select kernel, and the kernel can select its own packing functions, we can now specialize the packing functions per target. As matrices get larger, the packing performance matters much less, but for small matrix products it contributes more to the runtime. The default packing also already has a special case for contiguous matrices, which happens when in C = A B, A is column major and B is row major. The specialization in this commit helps the most outside this special case.

avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.

Remove flags that are now used by default by miri.

cgemm was not tested as nostd in ci

For Bazel compatibility. Fixes #78

in layout benchmarks, which are used to check packing and kernel sensitivity to memory layout, test some non-contiguous layouts.

A user showed that in certain configurations on macos, the TLS allocation can even be 8-byte aligned.

Cargo cross does not support this old rust version anymore, increase cross versions.

Completely distrust repr(align()) on macos and always manually ensure basic alignment.

Requested 32-alignment for s390x but thread local storage does not supply it. Lower requested align to 16 in general to avoid having this problem pop up on other platforms too.

Commits on Jan 8, 2021

0.3.0

bluss committed Jan 8, 2021

Configuration menu

View commit details

Copy full SHA for bad7c38

Browse repository at this point

Copy the full SHA

bad7c38 View commit details

Browse the repository at this point in the history

Commits on Nov 7, 2021

kernel: Use pub(crate)

bluss committed Nov 7, 2021

Configuration menu

View commit details

Copy full SHA for 7b1979a

Browse repository at this point

Copy the full SHA

7b1979a View commit details

Browse the repository at this point in the history

Commits on Nov 8, 2021

FIX: Typo in cfg(feature) in tests

bluss committed Nov 8, 2021

Configuration menu

View commit details

Copy full SHA for 1f6a175

Browse repository at this point

Copy the full SHA

1f6a175 View commit details

Browse the repository at this point in the history

Commits on Nov 17, 2021

test: Run CI on macos too

bluss committed Nov 17, 2021

Configuration menu

View commit details

Copy full SHA for 88a3c91

Browse repository at this point

Copy the full SHA

88a3c91 View commit details

Browse the repository at this point in the history

Commits on Nov 20, 2021

0.3.2

bluss committed Nov 20, 2021

Configuration menu

View commit details

Copy full SHA for 38d8f1a

Browse repository at this point

Copy the full SHA

38d8f1a View commit details

Browse the repository at this point in the history

Commits on Apr 28, 2023

0.3.4

bluss committed Apr 28, 2023

Configuration menu

View commit details

Copy full SHA for 5e0aea7

Browse repository at this point

Copy the full SHA

5e0aea7 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from bluss:master #3

[pull] master from bluss:master #3

Commits on Dec 6, 2020

Commits on Dec 7, 2020

Commits on Dec 28, 2020

Commits on Jan 1, 2021

Commits on Jan 4, 2021

Commits on Jan 5, 2021

Commits on Jan 7, 2021

Commits on Jan 8, 2021

Commits on Feb 7, 2021

Commits on Apr 8, 2021

Commits on Apr 9, 2021

Commits on Nov 7, 2021

Commits on Nov 8, 2021

Commits on Nov 11, 2021

Commits on Nov 13, 2021

Commits on Nov 14, 2021

Commits on Nov 15, 2021

Commits on Nov 16, 2021

Commits on Nov 17, 2021

Commits on Nov 20, 2021

Commits on May 1, 2022

Commits on May 2, 2022

Commits on May 3, 2022

Commits on Apr 17, 2023

Commits on Apr 20, 2023

Commits on Apr 21, 2023

Commits on Apr 25, 2023

Commits on Apr 26, 2023

Commits on Apr 28, 2023

Commits on Apr 30, 2023

Commits on May 2, 2023

Commits on May 6, 2023

Commits on Sep 20, 2023

Commits on Mar 9, 2024

Commits on Jul 27, 2024

[pull] master from bluss:master #3

Are you sure you want to change the base?

[pull] master from bluss:master #3

Commits on Dec 6, 2020

Commits on Dec 7, 2020

Commits on Dec 28, 2020

Commits on Jan 1, 2021

Commits on Jan 4, 2021

Commits on Jan 5, 2021

Commits on Jan 7, 2021

Commits on Jan 8, 2021

Commits on Feb 7, 2021

Commits on Apr 8, 2021

Commits on Apr 9, 2021

Commits on Nov 7, 2021

Commits on Nov 8, 2021

Commits on Nov 11, 2021

Commits on Nov 13, 2021

Commits on Nov 14, 2021

Commits on Nov 15, 2021

Commits on Nov 16, 2021

Commits on Nov 17, 2021

Commits on Nov 20, 2021

Commits on May 1, 2022

Commits on May 2, 2022

Commits on May 3, 2022

Commits on Apr 17, 2023

Commits on Apr 20, 2023

Commits on Apr 21, 2023

Commits on Apr 25, 2023

Commits on Apr 26, 2023

Commits on Apr 28, 2023

Commits on Apr 30, 2023

Commits on May 2, 2023

Commits on May 6, 2023

Commits on Sep 20, 2023

Commits on Mar 9, 2024

Commits on Jul 27, 2024