-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from bluss:master #3
base: master
Are you sure you want to change the base?
Commits on Dec 6, 2020
-
Co-authored-by: Geordon Worley <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5d7ae23 - Browse repository at this point
Copy the full SHA 5d7ae23View commit details
Commits on Dec 7, 2020
-
Configuration menu - View commit details
-
Copy full SHA for 97921e9 - Browse repository at this point
Copy the full SHA 97921e9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6243d28 - Browse repository at this point
Copy the full SHA 6243d28View commit details -
Configuration menu - View commit details
-
Copy full SHA for 79b57a3 - Browse repository at this point
Copy the full SHA 79b57a3View commit details
Commits on Dec 28, 2020
-
TEST: Add github actions to replace travis
Add fast test switch so we can skip big matrices in cross tests
Configuration menu - View commit details
-
Copy full SHA for 48fdf21 - Browse repository at this point
Copy the full SHA 48fdf21View commit details -
Merge pull request #53 from bluss/gh-actions
Add github actions to replace travis
Configuration menu - View commit details
-
Copy full SHA for 11ec355 - Browse repository at this point
Copy the full SHA 11ec355View commit details -
Configuration menu - View commit details
-
Copy full SHA for a713a6b - Browse repository at this point
Copy the full SHA a713a6bView commit details -
TEST: Add benchmark runner as example binary
This binary makes it easy to run custom benchmarks of bigger matrices with custom size, layout and threading.
Configuration menu - View commit details
-
Copy full SHA for b81d267 - Browse repository at this point
Copy the full SHA b81d267View commit details -
Configuration menu - View commit details
-
Copy full SHA for fc30d9d - Browse repository at this point
Copy the full SHA fc30d9dView commit details -
Configuration menu - View commit details
-
Copy full SHA for e926f0a - Browse repository at this point
Copy the full SHA e926f0aView commit details -
Merge pull request #54 from bluss/benchmark
Add benchmark runner as an "example" binary
Configuration menu - View commit details
-
Copy full SHA for 319e49e - Browse repository at this point
Copy the full SHA 319e49eView commit details -
Configuration menu - View commit details
-
Copy full SHA for a3fd081 - Browse repository at this point
Copy the full SHA a3fd081View commit details -
Configuration menu - View commit details
-
Copy full SHA for cb0ca4b - Browse repository at this point
Copy the full SHA cb0ca4bView commit details -
Configuration menu - View commit details
-
Copy full SHA for eb5582b - Browse repository at this point
Copy the full SHA eb5582bView commit details
Commits on Jan 1, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 01e8ba2 - Browse repository at this point
Copy the full SHA 01e8ba2View commit details -
Configuration menu - View commit details
-
Copy full SHA for 860ec38 - Browse repository at this point
Copy the full SHA 860ec38View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8b39aae - Browse repository at this point
Copy the full SHA 8b39aaeView commit details -
Configuration menu - View commit details
-
Copy full SHA for a761cfc - Browse repository at this point
Copy the full SHA a761cfcView commit details -
Configuration menu - View commit details
-
Copy full SHA for e2040fc - Browse repository at this point
Copy the full SHA e2040fcView commit details -
Configuration menu - View commit details
-
Copy full SHA for 72d036f - Browse repository at this point
Copy the full SHA 72d036fView commit details -
FIX: Only use thread local if have std
For std and threading we can use the thread_local!() macro, but for no-std we'll need to use a stack array instead.
Configuration menu - View commit details
-
Copy full SHA for 55ffa7f - Browse repository at this point
Copy the full SHA 55ffa7fView commit details -
Configuration menu - View commit details
-
Copy full SHA for a0343ff - Browse repository at this point
Copy the full SHA a0343ffView commit details -
FIX: Add heuristic to avoid using threads for small matrices
It depends a lot on hardware, when we should use threads, so even having a heuristic is risky, but we'll add one and can improve it later.
Configuration menu - View commit details
-
Copy full SHA for 6b3158c - Browse repository at this point
Copy the full SHA 6b3158cView commit details -
MAINT: Disable warning for unused macro
On non-x86, this macro can be unused.
Configuration menu - View commit details
-
Copy full SHA for 9879d9e - Browse repository at this point
Copy the full SHA 9879d9eView commit details -
MAINT: Enable debug info in release/bench mode
These are used all the time for profiling (and only affects development, so they might as well be enabled.)
Configuration menu - View commit details
-
Copy full SHA for e941ba3 - Browse repository at this point
Copy the full SHA e941ba3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 04264b0 - Browse repository at this point
Copy the full SHA 04264b0View commit details -
FIX: Put threadpool and nthreads into one combined Lazy
This is a performance fix, using one Lazy/OnceCell instead of two separate ones saves a little time - it's just a few ns - which was visible in the benchmark for (too) small matrices.
Configuration menu - View commit details
-
Copy full SHA for 33951b5 - Browse repository at this point
Copy the full SHA 33951b5View commit details
Commits on Jan 4, 2021
-
FIX: Use an UnsafeCell for the kernel mask buffer and align it with repr
Use repr(align(x)) so we don't have to oversize and manually align the mask buffer. Also use an UnsafeCell to remove (the very small, few ns) overhead of borrowing the RefCell. (Its borrowing was pointless anyway, since we held the raw pointer much longer than RefCell "borrow".)
Configuration menu - View commit details
-
Copy full SHA for 2ddd0ba - Browse repository at this point
Copy the full SHA 2ddd0baView commit details -
FIX: Split LoopThreadConfig::new into one non-generic part
Read the kernel parameters and pass to a non-generic function - we don't need to duplicate it for each kernel implementation.
Configuration menu - View commit details
-
Copy full SHA for 5f9b4cd - Browse repository at this point
Copy the full SHA 5f9b4cdView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9c58f3c - Browse repository at this point
Copy the full SHA 9c58f3cView commit details
Commits on Jan 5, 2021
-
MAINT: Set MSRV to Rust 1.41.1 and update Rust version policy
Follow the community standards of not having version updates as breaking changes. We still want to be careful.
Configuration menu - View commit details
-
Copy full SHA for 612781d - Browse repository at this point
Copy the full SHA 612781dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5e4f356 - Browse repository at this point
Copy the full SHA 5e4f356View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4104d26 - Browse repository at this point
Copy the full SHA 4104d26View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2dfe4f0 - Browse repository at this point
Copy the full SHA 2dfe4f0View commit details -
Configuration menu - View commit details
-
Copy full SHA for f10f3ae - Browse repository at this point
Copy the full SHA f10f3aeView commit details
Commits on Jan 7, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 80c5e2c - Browse repository at this point
Copy the full SHA 80c5e2cView commit details -
Configuration menu - View commit details
-
Copy full SHA for a507ff5 - Browse repository at this point
Copy the full SHA a507ff5View commit details -
Configuration menu - View commit details
-
Copy full SHA for b7f46bc - Browse repository at this point
Copy the full SHA b7f46bcView commit details
Commits on Jan 8, 2021
-
Configuration menu - View commit details
-
Copy full SHA for bad7c38 - Browse repository at this point
Copy the full SHA bad7c38View commit details
Commits on Feb 7, 2021
-
FIX: Use &[T], not &T for the mask buffer
This fixes an older experiment - using &T to get a dereferenceable unaliased pointer - to instead using &[T], which is the only correct way to do it since more than one element is accessed from the pointer.
Configuration menu - View commit details
-
Copy full SHA for 9e4a11f - Browse repository at this point
Copy the full SHA 9e4a11fView commit details
Commits on Apr 8, 2021
-
Configuration menu - View commit details
-
Copy full SHA for d5c994e - Browse repository at this point
Copy the full SHA d5c994eView commit details -
Merge pull request #56 from bluss/align-manually
Align mask buffer pointer manually
Configuration menu - View commit details
-
Copy full SHA for 77dd2b1 - Browse repository at this point
Copy the full SHA 77dd2b1View commit details -
Configuration menu - View commit details
-
Copy full SHA for d0e1c54 - Browse repository at this point
Copy the full SHA d0e1c54View commit details
Commits on Apr 9, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 5d20d85 - Browse repository at this point
Copy the full SHA 5d20d85View commit details
Commits on Nov 7, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 7b1979a - Browse repository at this point
Copy the full SHA 7b1979aView commit details
Commits on Nov 8, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 1f6a175 - Browse repository at this point
Copy the full SHA 1f6a175View commit details
Commits on Nov 11, 2021
-
threading: Tweak the threading factor
Be a bit more eager to use threads (there is a heuristic vs matrix size).
Configuration menu - View commit details
-
Copy full SHA for 8b75092 - Browse repository at this point
Copy the full SHA 8b75092View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5907bf0 - Browse repository at this point
Copy the full SHA 5907bf0View commit details -
complex: Add support for complex
Use the feature name "cgemm" for cgemm/zgemm methods; start them off by adding fallback implementations using 4x2 kernels. CGemmOptions added as a placeholder - can later include options for conjugating either operand (transpose not required - the strides provide that freedom already). Also update the benchmark. Complex is using the representation [f64; 2] here which is representation compatible in memory with C and with num_complex.
Configuration menu - View commit details
-
Copy full SHA for ab05f63 - Browse repository at this point
Copy the full SHA ab05f63View commit details -
Configuration menu - View commit details
-
Copy full SHA for 55888a2 - Browse repository at this point
Copy the full SHA 55888a2View commit details -
complex: compute cgemm, zgemm in real parts
Combine the cgemm, zgemm kernels. Break out components and compute the real/imag parts separately. This has especially large gains for FMA. A separate approach was also tried, using (P + Qi)(R + Si) and factoring into three separate multiplications (instead of four), and this was a benefit for cgemm, but not as big a benefit as using the simpler kernel in this PR, compiled using FMA.
Configuration menu - View commit details
-
Copy full SHA for 64f909f - Browse repository at this point
Copy the full SHA 64f909fView commit details -
complex: Compile fallback kernels using fma too
Compiling them separately gives some gains - especially for C32 on my configuration, a doubling in performance.
Configuration menu - View commit details
-
Copy full SHA for e04e4ba - Browse repository at this point
Copy the full SHA e04e4baView commit details -
Configuration menu - View commit details
-
Copy full SHA for 7535de1 - Browse repository at this point
Copy the full SHA 7535de1View commit details -
complex: Use a different flop factor for complex
The estimate here is gflop = 2 MNK for floats (common estimate) and we use gflop = 8 MNK for complex.
Configuration menu - View commit details
-
Copy full SHA for 5fe43c8 - Browse repository at this point
Copy the full SHA 5fe43c8View commit details -
complex: Print nicer type name for complex
Print f32 etc for float and c32 etc for complex
Configuration menu - View commit details
-
Copy full SHA for 6a813a5 - Browse repository at this point
Copy the full SHA 6a813a5View commit details -
benchmark: Make a better argument parser
Not a good one, but a better one than there was. So that one additional argument can be added.
Configuration menu - View commit details
-
Copy full SHA for f1b04ea - Browse repository at this point
Copy the full SHA f1b04eaView commit details -
benchmark: Allow passing --extra-column for an extra column in csv
This is useful when taking data - an extra column with more information
Configuration menu - View commit details
-
Copy full SHA for d904257 - Browse repository at this point
Copy the full SHA d904257View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2e8abde - Browse repository at this point
Copy the full SHA 2e8abdeView commit details -
tests: Combine repeated generic code in tests and benchmark
include!() was chosen for sharing this code. An internal crate could have been used, but it has the downside that the interal crate doesn't become part of the source package published on crates.io, which is nonideal (just a minor thing, but still). In the spirit of open source, the source package should contain the preferred setup for working with the project, and that includes the tests. With this solution, the tests are still buildable from the published package.
Configuration menu - View commit details
-
Copy full SHA for 8492b13 - Browse repository at this point
Copy the full SHA 8492b13View commit details -
test: Move common test_a_kernel function into kernel
Reduce duplication, now that we have 4 gemm kernel files. The ensurefeature does not need to be duplicated.
Configuration menu - View commit details
-
Copy full SHA for 5ce52bd - Browse repository at this point
Copy the full SHA 5ce52bdView commit details -
Configuration menu - View commit details
-
Copy full SHA for d5b13c8 - Browse repository at this point
Copy the full SHA d5b13c8View commit details -
test: Use complex scalars to test alpha/beta
Use complex scalars for alpha and beta to cover this better in the testsuite.
Configuration menu - View commit details
-
Copy full SHA for 60fc628 - Browse repository at this point
Copy the full SHA 60fc628View commit details -
test: Use both A I == A and I B == B in test_a_kernel
Increases test coverage by testing identity multiply on both sides.
Configuration menu - View commit details
-
Copy full SHA for 9ce5e23 - Browse repository at this point
Copy the full SHA 9ce5e23View commit details
Commits on Nov 13, 2021
-
Merge pull request #58 from bluss/complex
Add experimental support for complex: cgemm/zgemm
Configuration menu - View commit details
-
Copy full SHA for ab4a538 - Browse repository at this point
Copy the full SHA ab4a538View commit details -
Allow tweaking size parameters at compile time
This introduces compile-time tweak variables like this: - MATMUL_DGEMM_NC - MATMUL_DGEMM_MC - MATMUL_DGEMM_KC etc for each kernel. These allow setting these size parameters at compile time - they should ideally be optimized per kernel *and microarch*. Combine these parameters with the benchmark in ./examples/benchmark.rs and its csv output option - this allows optimizing performance depending on these parameters. Using DutchGhost's const parsing code from https://gist.github.com/DutchGhost/d8604a3c796479777fe9f5e25d855cfd which has been very useful. Co-authored-by: DutchGhost <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for efe70f2 - Browse repository at this point
Copy the full SHA efe70f2View commit details -
This script can vary over most parameters (threads, nc, kc, mc, types, sizes) to create benchmarks.
Configuration menu - View commit details
-
Copy full SHA for de0075d - Browse repository at this point
Copy the full SHA de0075dView commit details -
- Threading is supported in miri, except the num_cpus::get_physical needs the -Zmiri-disable-isolation flag - Miri is extremely slow at running the full unoptimized gemm loop unfortunately, any non-trivial matrix sizes are skipped in tests (which is a shame, there are more branches to cover for larger sizes).
Configuration menu - View commit details
-
Copy full SHA for 805221d - Browse repository at this point
Copy the full SHA 805221dView commit details
Commits on Nov 14, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 6dc6a76 - Browse repository at this point
Copy the full SHA 6dc6a76View commit details -
constconf: Fix usize parsing on 32-bit arch
There was an overflow in the pow10 table on 32-bit arches, try to fix this.
Configuration menu - View commit details
-
Copy full SHA for c9447f3 - Browse repository at this point
Copy the full SHA c9447f3View commit details -
test: Run the benchmark loop script in ci
Just to make sure it continues to work
Configuration menu - View commit details
-
Copy full SHA for 58623fb - Browse repository at this point
Copy the full SHA 58623fbView commit details
Commits on Nov 15, 2021
-
test: Factor out common matrix compare
The new matrix compare looks at equality within a tolerance. As-written, the testsuite would end up with only exact integer floats, or additions where it loses precisions in the expected way. When changing KC to a smaller value or, when testing larger matrices, because this impacts loop 4 and we do one update (+=) to C per iteration of loop 4 - then this can be visible in the precision or rounding error of the result. Thus, with varying KC - we must have a relative tolerance for equality.
Configuration menu - View commit details
-
Copy full SHA for aa0ce95 - Browse repository at this point
Copy the full SHA aa0ce95View commit details
Commits on Nov 16, 2021
-
constconf: Add assertions for MC, KC, NC parameters
Since these are now compile-time configurable, we need some limits on them. We are about 99% sure we have correct results even if we vary these parameters wildly. But it should be clear they are configurable for parameter exploration and optimization.
Configuration menu - View commit details
-
Copy full SHA for 510b9dc - Browse repository at this point
Copy the full SHA 510b9dcView commit details -
Configuration menu - View commit details
-
Copy full SHA for ecb8630 - Browse repository at this point
Copy the full SHA ecb8630View commit details -
Configuration menu - View commit details
-
Copy full SHA for c2562ae - Browse repository at this point
Copy the full SHA c2562aeView commit details
Commits on Nov 17, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 88a3c91 - Browse repository at this point
Copy the full SHA 88a3c91View commit details
Commits on Nov 20, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 38d8f1a - Browse repository at this point
Copy the full SHA 38d8f1aView commit details
Commits on May 1, 2022
-
ptr: Fix Send/Sync impls for future compat warning
Add a "definitely non-Send/Sync" field to Ptr so that the explicit Send/Sync impls become unambiguous to the compiler. This fixes the warning from rust-lang/rust#93367
Configuration menu - View commit details
-
Copy full SHA for 4c3950d - Browse repository at this point
Copy the full SHA 4c3950dView commit details -
Fix Miri error with -Zmiri-tag-raw-pointers
Before this PR, running `MIRIFLAGS="-Zmiri-tag-raw-pointers" cargo miri test` caused Miri to report undefined behavior in the `test_dgemm` test. This PR fixes the underlying issue – Miri doesn't like us using a reference to an element to access other elements.
Configuration menu - View commit details
-
Copy full SHA for f8f9d21 - Browse repository at this point
Copy the full SHA f8f9d21View commit details -
Configuration menu - View commit details
-
Copy full SHA for c2cb362 - Browse repository at this point
Copy the full SHA c2cb362View commit details
Commits on May 2, 2022
-
Updated comment in kernel_x86_avx
* Updated comment in function kernel_x86_avx to reflect actual procedure where permutations of a and b are generated. Also updated possible alternative selections mentioned in the comment for the operation: '''let b_3210 = _mm256_permute2f128_pd(b_1032, b_1032, 0x03);''' According to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6418,5227&techs=AVX,AVX2&cats=Swizzle and subsequent testing alternatives for the selection are equivalent for equal source vectors if the second bit in the nibbles is switched. * Removed redundant comment Removed two lines of comments, which basically repeated the code below. Also changed a hexadecimal to a binary value for a mask value in a SIMD intrinsic to increase readability.
Configuration menu - View commit details
-
Copy full SHA for 4ef1bd9 - Browse repository at this point
Copy the full SHA 4ef1bd9View commit details
Commits on May 3, 2022
-
ptr: Silence suspicious Send/Sync impls warning
Revisit the previous fix. It's clear that the Ptr code will not change meaning with the coming breaking change, because it's only ever used with Ptr<*const T> and Ptr<*mut T> as parameterizations. For this reason, it's more correct IMO to accept the code meaning change, which doesn't change the meaning of the crate, by silencing the warning and not making any unnecessary and ugly changes to the struct fields.
Configuration menu - View commit details
-
Copy full SHA for 4f841fa - Browse repository at this point
Copy the full SHA 4f841faView commit details
Commits on Apr 17, 2023
-
gemm: request only 16-byte alignment on macos
As a second fix for the TLS on macos alignment problem, request only 16-byte alignment on macos, because we don't get more. There's a new debug assertion in std which trips on accessing otherwise, even if it does not really affect us.
Configuration menu - View commit details
-
Copy full SHA for 1433d63 - Browse repository at this point
Copy the full SHA 1433d63View commit details
Commits on Apr 20, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 1f8d3c7 - Browse repository at this point
Copy the full SHA 1f8d3c7View commit details -
The Range::next() method showed up uninlined in the benchmark, and this was a measuarable (~10%) improvement.
Configuration menu - View commit details
-
Copy full SHA for 15da77c - Browse repository at this point
Copy the full SHA 15da77cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5bf5c7c - Browse repository at this point
Copy the full SHA 5bf5c7cView commit details
Commits on Apr 21, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 34e740e - Browse repository at this point
Copy the full SHA 34e740eView commit details -
Configuration menu - View commit details
-
Copy full SHA for fac92b6 - Browse repository at this point
Copy the full SHA fac92b6View commit details -
Configuration menu - View commit details
-
Copy full SHA for d6f7a34 - Browse repository at this point
Copy the full SHA d6f7a34View commit details -
threading: Remove bias for aarch64
Remove the bias factor for matrix size for aarch64 in computing max number of threads for a given matrix.
Configuration menu - View commit details
-
Copy full SHA for fe2b237 - Browse repository at this point
Copy the full SHA fe2b237View commit details
Commits on Apr 25, 2023
-
This function is a special case and should never be inlined, so put it out to the side.
Configuration menu - View commit details
-
Copy full SHA for c5d1930 - Browse repository at this point
Copy the full SHA c5d1930View commit details -
gemm: Use slice for packing buffer
Using a reference type (such as a slice) for either pack or a in the packing function makes rustc emit a noalias annotation for that pointer, and that helps the optimizer in some cases. What we want is that the compiler sees that the pointers pack and a and pointers derived from them, can never alias, then it has more freedom to rewrite the operations in the packing loops. The pack buffer is contiguous so it's the only choice for passing one of the two arguments as a slice. Shown to slightly speed up the layout_f32 benchmark for sgemm, not dgemm, on M1. A way to get the same effect without a slice would be good for this crate, like a 'restrict' keyword.
Configuration menu - View commit details
-
Copy full SHA for 0fea705 - Browse repository at this point
Copy the full SHA 0fea705View commit details -
Use build script to preserve MSRV on aarch64
Check rust version in build script and conditionally use new kernels. One could choose to just raise msrv specifically for aarch64 to 1.61 here, but being unobtrusive should be the right thing to do. There are likely more version sensitive features coming because of the subject (simd, asm). Autocfg seems like one of the best choices for version check. It's already used by num crates.
Configuration menu - View commit details
-
Copy full SHA for 058d3ef - Browse repository at this point
Copy the full SHA 058d3efView commit details
Commits on Apr 26, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 35c258d - Browse repository at this point
Copy the full SHA 35c258dView commit details
Commits on Apr 28, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 5e0aea7 - Browse repository at this point
Copy the full SHA 5e0aea7View commit details
Commits on Apr 30, 2023
-
gemm: Allow custom packing functions
For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.
Configuration menu - View commit details
-
Copy full SHA for b85cfa1 - Browse repository at this point
Copy the full SHA b85cfa1View commit details -
complex: pack real and imag separately
Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.
Configuration menu - View commit details
-
Copy full SHA for 9896879 - Browse repository at this point
Copy the full SHA 9896879View commit details -
cgemm: Setup Avx2 and Fma autovectorized kernels
Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.
Configuration menu - View commit details
-
Copy full SHA for 6f86fd9 - Browse repository at this point
Copy the full SHA 6f86fd9View commit details -
x86-64: Specialize pack function for avx2
Because we detect target features to select kernel, and the kernel can select its own packing functions, we can now specialize the packing functions per target. As matrices get larger, the packing performance matters much less, but for small matrix products it contributes more to the runtime. The default packing also already has a special case for contiguous matrices, which happens when in C = A B, A is column major and B is row major. The specialization in this commit helps the most outside this special case.
Configuration menu - View commit details
-
Copy full SHA for 2c536f2 - Browse repository at this point
Copy the full SHA 2c536f2View commit details -
avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.
Configuration menu - View commit details
-
Copy full SHA for 18bd827 - Browse repository at this point
Copy the full SHA 18bd827View commit details -
Configuration menu - View commit details
-
Copy full SHA for e6d04e1 - Browse repository at this point
Copy the full SHA e6d04e1View commit details -
Configuration menu - View commit details
-
Copy full SHA for e84562d - Browse repository at this point
Copy the full SHA e84562dView commit details -
Remove flags that are now used by default by miri.
Configuration menu - View commit details
-
Copy full SHA for 84c0baa - Browse repository at this point
Copy the full SHA 84c0baaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 258a69f - Browse repository at this point
Copy the full SHA 258a69fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 145f9e8 - Browse repository at this point
Copy the full SHA 145f9e8View commit details -
Configuration menu - View commit details
-
Copy full SHA for d88b19e - Browse repository at this point
Copy the full SHA d88b19eView commit details
Commits on May 2, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 496f08a - Browse repository at this point
Copy the full SHA 496f08aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 836e5ae - Browse repository at this point
Copy the full SHA 836e5aeView commit details
Commits on May 6, 2023
-
bench: Add non-contiguous layouts
in layout benchmarks, which are used to check packing and kernel sensitivity to memory layout, test some non-contiguous layouts.
Configuration menu - View commit details
-
Copy full SHA for d6aef69 - Browse repository at this point
Copy the full SHA d6aef69View commit details
Commits on Sep 20, 2023
-
gemm: request 8-byte buffer alignment on macos
A user showed that in certain configurations on macos, the TLS allocation can even be 8-byte aligned.
Configuration menu - View commit details
-
Copy full SHA for c6f86de - Browse repository at this point
Copy the full SHA c6f86deView commit details -
Cargo cross does not support this old rust version anymore, increase cross versions.
Configuration menu - View commit details
-
Copy full SHA for 86f4432 - Browse repository at this point
Copy the full SHA 86f4432View commit details -
gemm: Ensure alignment without repr(align()) on macos
Completely distrust repr(align()) on macos and always manually ensure basic alignment.
Configuration menu - View commit details
-
Copy full SHA for 7753f81 - Browse repository at this point
Copy the full SHA 7753f81View commit details -
Configuration menu - View commit details
-
Copy full SHA for e8caf74 - Browse repository at this point
Copy the full SHA e8caf74View commit details
Commits on Mar 9, 2024
-
Configuration menu - View commit details
-
Copy full SHA for a0bf1bb - Browse repository at this point
Copy the full SHA a0bf1bbView commit details -
Configuration menu - View commit details
-
Copy full SHA for 29f3d1c - Browse repository at this point
Copy the full SHA 29f3d1cView commit details -
Configuration menu - View commit details
-
Copy full SHA for c7ab1ac - Browse repository at this point
Copy the full SHA c7ab1acView commit details
Commits on Jul 27, 2024
-
Fix alignment in s390x and cross test
Requested 32-alignment for s390x but thread local storage does not supply it. Lower requested align to 16 in general to avoid having this problem pop up on other platforms too.
Configuration menu - View commit details
-
Copy full SHA for 77ed4e0 - Browse repository at this point
Copy the full SHA 77ed4e0View commit details -
Configuration menu - View commit details
-
Copy full SHA for bb3dd0b - Browse repository at this point
Copy the full SHA bb3dd0bView commit details