Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from bluss:master #3

Open
wants to merge 117 commits into
base: master
Choose a base branch
from
Open

Commits on Dec 6, 2020

  1. add no_std support

    Co-authored-by: Geordon Worley <[email protected]>
    jturner314 and vadixidav committed Dec 6, 2020
    Configuration menu
    Copy the full SHA
    5d7ae23 View commit details
    Browse the repository at this point in the history

Commits on Dec 7, 2020

  1. Merge pull request #51 from vadixidav/no_std

    no_std support
    bluss authored Dec 7, 2020
    Configuration menu
    Copy the full SHA
    97921e9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6243d28 View commit details
    Browse the repository at this point in the history
  3. 0.2.4

    bluss committed Dec 7, 2020
    Configuration menu
    Copy the full SHA
    79b57a3 View commit details
    Browse the repository at this point in the history

Commits on Dec 28, 2020

  1. TEST: Add github actions to replace travis

    Add fast test switch so we can skip big matrices in cross tests
    bluss committed Dec 28, 2020
    Configuration menu
    Copy the full SHA
    48fdf21 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #53 from bluss/gh-actions

    Add github actions to replace travis
    bluss authored Dec 28, 2020
    Configuration menu
    Copy the full SHA
    11ec355 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a713a6b View commit details
    Browse the repository at this point in the history
  4. TEST: Add benchmark runner as example binary

    This binary makes it easy to run custom benchmarks of bigger matrices
    with custom size, layout and threading.
    bluss committed Dec 28, 2020
    Configuration menu
    Copy the full SHA
    b81d267 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    fc30d9d View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    e926f0a View commit details
    Browse the repository at this point in the history
  7. Merge pull request #54 from bluss/benchmark

    Add benchmark runner as an "example" binary
    bluss authored Dec 28, 2020
    Configuration menu
    Copy the full SHA
    319e49e View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    a3fd081 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    cb0ca4b View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    eb5582b View commit details
    Browse the repository at this point in the history

Commits on Jan 1, 2021

  1. Configuration menu
    Copy the full SHA
    01e8ba2 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    860ec38 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8b39aae View commit details
    Browse the repository at this point in the history
  4. TEST: Test threading feature

    bluss committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    a761cfc View commit details
    Browse the repository at this point in the history
  5. TEST: Test from 1.42

    bluss committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    e2040fc View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    72d036f View commit details
    Browse the repository at this point in the history
  7. FIX: Only use thread local if have std

    For std and threading we can use the thread_local!() macro, but for
    no-std we'll need to use a stack array instead.
    bluss committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    55ffa7f View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    a0343ff View commit details
    Browse the repository at this point in the history
  9. FIX: Add heuristic to avoid using threads for small matrices

    It depends a lot on hardware, when we should use threads, so even having
    a heuristic is risky, but we'll add one and can improve it later.
    bluss committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    6b3158c View commit details
    Browse the repository at this point in the history
  10. MAINT: Disable warning for unused macro

    On non-x86, this macro can be unused.
    bluss committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    9879d9e View commit details
    Browse the repository at this point in the history
  11. MAINT: Enable debug info in release/bench mode

    These are used all the time for profiling (and only affects development,
    so they might as well be enabled.)
    bluss committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    e941ba3 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    04264b0 View commit details
    Browse the repository at this point in the history
  13. FIX: Put threadpool and nthreads into one combined Lazy

    This is a performance fix, using one Lazy/OnceCell instead of two
    separate ones saves a little time - it's just a few ns - which was
    visible in the benchmark for (too) small matrices.
    bluss committed Jan 1, 2021
    Configuration menu
    Copy the full SHA
    33951b5 View commit details
    Browse the repository at this point in the history

Commits on Jan 4, 2021

  1. FIX: Use an UnsafeCell for the kernel mask buffer and align it with repr

    Use repr(align(x)) so we don't have to oversize and manually align the mask
    buffer. Also use an UnsafeCell to remove (the very small, few ns)
    overhead of borrowing the RefCell. (Its borrowing was pointless anyway,
    since we held the raw pointer much longer than RefCell "borrow".)
    bluss committed Jan 4, 2021
    Configuration menu
    Copy the full SHA
    2ddd0ba View commit details
    Browse the repository at this point in the history
  2. FIX: Split LoopThreadConfig::new into one non-generic part

    Read the kernel parameters and pass to a non-generic function - we don't
    need to duplicate it for each kernel implementation.
    bluss committed Jan 4, 2021
    Configuration menu
    Copy the full SHA
    5f9b4cd View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    9c58f3c View commit details
    Browse the repository at this point in the history

Commits on Jan 5, 2021

  1. MAINT: Set MSRV to Rust 1.41.1 and update Rust version policy

    Follow the community standards of not having version updates as breaking
    changes. We still want to be careful.
    bluss committed Jan 5, 2021
    Configuration menu
    Copy the full SHA
    612781d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5e4f356 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4104d26 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    2dfe4f0 View commit details
    Browse the repository at this point in the history
  5. Merge pull request #52 from bluss/threading

    Add threading support
    bluss authored Jan 5, 2021
    Configuration menu
    Copy the full SHA
    f10f3ae View commit details
    Browse the repository at this point in the history

Commits on Jan 7, 2021

  1. Configuration menu
    Copy the full SHA
    80c5e2c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a507ff5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b7f46bc View commit details
    Browse the repository at this point in the history

Commits on Jan 8, 2021

  1. 0.3.0

    bluss committed Jan 8, 2021
    Configuration menu
    Copy the full SHA
    bad7c38 View commit details
    Browse the repository at this point in the history

Commits on Feb 7, 2021

  1. FIX: Use &[T], not &T for the mask buffer

    This fixes an older experiment - using &T to get a dereferenceable
    unaliased pointer - to instead using &[T], which is the only correct way
    to do it since more than one element is accessed from the pointer.
    bluss committed Feb 7, 2021
    Configuration menu
    Copy the full SHA
    9e4a11f View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2021

  1. Configuration menu
    Copy the full SHA
    d5c994e View commit details
    Browse the repository at this point in the history
  2. Merge pull request #56 from bluss/align-manually

    Align mask buffer pointer manually
    bluss authored Apr 8, 2021
    Configuration menu
    Copy the full SHA
    77dd2b1 View commit details
    Browse the repository at this point in the history
  3. 0.3.1

    bluss committed Apr 8, 2021
    Configuration menu
    Copy the full SHA
    d0e1c54 View commit details
    Browse the repository at this point in the history

Commits on Apr 9, 2021

  1. FIX: Kernel size in assertion

    No impact on functionality.
    bluss committed Apr 9, 2021
    Configuration menu
    Copy the full SHA
    5d20d85 View commit details
    Browse the repository at this point in the history

Commits on Nov 7, 2021

  1. kernel: Use pub(crate)

    bluss committed Nov 7, 2021
    Configuration menu
    Copy the full SHA
    7b1979a View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2021

  1. Configuration menu
    Copy the full SHA
    1f6a175 View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2021

  1. threading: Tweak the threading factor

    Be a bit more eager to use threads (there is a heuristic vs matrix
    size).
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    8b75092 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5907bf0 View commit details
    Browse the repository at this point in the history
  3. complex: Add support for complex

    Use the feature name "cgemm" for cgemm/zgemm methods; start them
    off by adding fallback implementations using 4x2 kernels.
    
    CGemmOptions added as a placeholder - can later include options for
    conjugating either operand (transpose not required - the strides provide
    that freedom already).
    
    Also update the benchmark.
    
    Complex is using the representation [f64; 2] here which is
    representation compatible in memory with C and with num_complex.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    ab05f63 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    55888a2 View commit details
    Browse the repository at this point in the history
  5. complex: compute cgemm, zgemm in real parts

    Combine the cgemm, zgemm kernels.  Break out components
    and compute the real/imag parts separately. This has especially large
    gains for FMA.
    
    A separate approach was also tried, using (P + Qi)(R + Si) and
    factoring into three separate multiplications (instead of four),
    and this was a benefit for cgemm, but not as big a benefit as using
    the simpler kernel in this PR, compiled using FMA.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    64f909f View commit details
    Browse the repository at this point in the history
  6. complex: Compile fallback kernels using fma too

    Compiling them separately gives some gains - especially for C32 on my
    configuration, a doubling in performance.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    e04e4ba View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    7535de1 View commit details
    Browse the repository at this point in the history
  8. complex: Use a different flop factor for complex

    The estimate here is gflop = 2 MNK for floats (common estimate)
     and we use gflop = 8 MNK for complex.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    5fe43c8 View commit details
    Browse the repository at this point in the history
  9. complex: Print nicer type name for complex

    Print f32 etc for float and c32 etc for complex
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    6a813a5 View commit details
    Browse the repository at this point in the history
  10. benchmark: Make a better argument parser

    Not a good one, but a better one than there was.
    So that one additional argument can be added.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    f1b04ea View commit details
    Browse the repository at this point in the history
  11. benchmark: Allow passing --extra-column for an extra column in csv

    This is useful when taking data - an extra column with more information
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    d904257 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    2e8abde View commit details
    Browse the repository at this point in the history
  13. tests: Combine repeated generic code in tests and benchmark

    include!() was chosen for sharing this code. An internal crate could
    have been used, but it has the downside that the interal crate doesn't
    become part of the source package published on crates.io, which is
    nonideal (just a minor thing, but still).
    
    In the spirit of open source, the source package should contain the
    preferred setup for working with the project, and that includes the
    tests. With this solution, the tests are still buildable from the
    published package.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    8492b13 View commit details
    Browse the repository at this point in the history
  14. test: Move common test_a_kernel function into kernel

    Reduce duplication, now that we have 4 gemm kernel files.
    The ensurefeature does not need to be duplicated.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    5ce52bd View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    d5b13c8 View commit details
    Browse the repository at this point in the history
  16. test: Use complex scalars to test alpha/beta

    Use complex scalars for alpha and beta to cover this better in the
    testsuite.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    60fc628 View commit details
    Browse the repository at this point in the history
  17. test: Use both A I == A and I B == B in test_a_kernel

    Increases test coverage by testing identity multiply on both sides.
    bluss committed Nov 11, 2021
    Configuration menu
    Copy the full SHA
    9ce5e23 View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2021

  1. Merge pull request #58 from bluss/complex

    Add experimental support for complex: cgemm/zgemm
    bluss authored Nov 13, 2021
    Configuration menu
    Copy the full SHA
    ab4a538 View commit details
    Browse the repository at this point in the history
  2. Allow tweaking size parameters at compile time

    This introduces compile-time tweak variables like this:
    
    - MATMUL_DGEMM_NC
    - MATMUL_DGEMM_MC
    - MATMUL_DGEMM_KC
    
    etc for each kernel. These allow setting these size parameters at
    compile time - they should ideally be optimized per kernel *and
    microarch*.
    
    Combine these parameters with the benchmark in ./examples/benchmark.rs
    and its csv output option - this allows optimizing performance depending
    on these parameters.
    
    Using DutchGhost's const parsing code from
    
    https://gist.github.com/DutchGhost/d8604a3c796479777fe9f5e25d855cfd
    
    which has been very useful.
    
    Co-authored-by: DutchGhost <[email protected]>
    bluss and DutchGhost committed Nov 13, 2021
    Configuration menu
    Copy the full SHA
    efe70f2 View commit details
    Browse the repository at this point in the history
  3. test: Add benchmarking script

    This script can vary over most parameters (threads, nc, kc, mc, types,
    sizes) to create benchmarks.
    bluss committed Nov 13, 2021
    Configuration menu
    Copy the full SHA
    de0075d View commit details
    Browse the repository at this point in the history
  4. test: Run miri

    - Threading is supported in miri, except the num_cpus::get_physical
      needs the -Zmiri-disable-isolation flag
    
    - Miri is extremely slow at running the full unoptimized gemm loop
      unfortunately, any non-trivial matrix sizes are skipped in tests
      (which is a shame, there are more branches to cover for larger sizes).
    bluss committed Nov 13, 2021
    Configuration menu
    Copy the full SHA
    805221d View commit details
    Browse the repository at this point in the history

Commits on Nov 14, 2021

  1. Fix crates.io badge

    atouchet authored and bluss committed Nov 14, 2021
    Configuration menu
    Copy the full SHA
    6dc6a76 View commit details
    Browse the repository at this point in the history
  2. constconf: Fix usize parsing on 32-bit arch

    There was an overflow in the pow10 table on 32-bit arches, try to fix
    this.
    bluss committed Nov 14, 2021
    Configuration menu
    Copy the full SHA
    c9447f3 View commit details
    Browse the repository at this point in the history
  3. test: Run the benchmark loop script in ci

    Just to make sure it continues to work
    bluss committed Nov 14, 2021
    Configuration menu
    Copy the full SHA
    58623fb View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2021

  1. test: Factor out common matrix compare

    The new matrix compare looks at equality within a tolerance.
    
    As-written, the testsuite would end up with only exact integer floats,
    or additions where it loses precisions in the expected way.
    
    When changing KC to a smaller value or, when testing larger matrices,
    because this impacts loop 4 and we do one update (+=) to C per iteration
    of loop 4 - then this can be visible in the precision or rounding error
    of the result. Thus, with varying KC - we must have a relative tolerance for
    equality.
    bluss committed Nov 15, 2021
    Configuration menu
    Copy the full SHA
    aa0ce95 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2021

  1. constconf: Add assertions for MC, KC, NC parameters

    Since these are now compile-time configurable, we need some limits on
    them. We are about 99% sure we have correct results even if we vary
    these parameters wildly. But it should be clear they are configurable
    for parameter exploration and optimization.
    bluss committed Nov 16, 2021
    Configuration menu
    Copy the full SHA
    510b9dc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ecb8630 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c2562ae View commit details
    Browse the repository at this point in the history

Commits on Nov 17, 2021

  1. test: Run CI on macos too

    bluss committed Nov 17, 2021
    Configuration menu
    Copy the full SHA
    88a3c91 View commit details
    Browse the repository at this point in the history

Commits on Nov 20, 2021

  1. 0.3.2

    bluss committed Nov 20, 2021
    Configuration menu
    Copy the full SHA
    38d8f1a View commit details
    Browse the repository at this point in the history

Commits on May 1, 2022

  1. ptr: Fix Send/Sync impls for future compat warning

    Add a "definitely non-Send/Sync" field to Ptr so that the explicit
    Send/Sync impls become unambiguous to the compiler.
    
    This fixes the warning from rust-lang/rust#93367
    bluss committed May 1, 2022
    Configuration menu
    Copy the full SHA
    4c3950d View commit details
    Browse the repository at this point in the history
  2. Fix Miri error with -Zmiri-tag-raw-pointers

    Before this PR, running `MIRIFLAGS="-Zmiri-tag-raw-pointers" cargo
    miri test` caused Miri to report undefined behavior in the
    `test_dgemm` test. This PR fixes the underlying issue – Miri doesn't
    like us using a reference to an element to access other elements.
    jturner314 authored and bluss committed May 1, 2022
    Configuration menu
    Copy the full SHA
    f8f9d21 View commit details
    Browse the repository at this point in the history
  3. Add more checks to MIRIFLAGS for CI

    jturner314 authored and bluss committed May 1, 2022
    Configuration menu
    Copy the full SHA
    c2cb362 View commit details
    Browse the repository at this point in the history

Commits on May 2, 2022

  1. Updated comment in kernel_x86_avx

    * Updated comment in function kernel_x86_avx to
    reflect actual procedure where permutations of a and b are generated.
    Also updated possible alternative selections mentioned in the comment
    for  the operation:
    '''let b_3210 = _mm256_permute2f128_pd(b_1032, b_1032, 0x03);'''
    According to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6418,5227&techs=AVX,AVX2&cats=Swizzle
    and subsequent testing
    alternatives for the selection are equivalent for equal source vectors
    if the second bit in the nibbles is switched.
    
    * Removed redundant comment
    
    Removed two lines of comments, which basically repeated the code below. Also changed a hexadecimal to a binary value for a mask value in a SIMD intrinsic to increase readability.
    Tastaturtaste authored May 2, 2022
    Configuration menu
    Copy the full SHA
    4ef1bd9 View commit details
    Browse the repository at this point in the history

Commits on May 3, 2022

  1. ptr: Silence suspicious Send/Sync impls warning

    Revisit the previous fix. It's clear that the Ptr code will not change
    meaning with the coming breaking change, because it's only ever used
    with Ptr<*const T> and Ptr<*mut T> as parameterizations.
    
    For this reason, it's more correct IMO to accept the code meaning
    change, which doesn't change the meaning of the crate, by silencing the
    warning and not making any unnecessary and ugly changes to the struct
    fields.
    bluss committed May 3, 2022
    Configuration menu
    Copy the full SHA
    4f841fa View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2023

  1. gemm: request only 16-byte alignment on macos

    As a second fix for the TLS on macos alignment problem, request only 16-byte
    alignment on macos, because we don't get more.
    
    There's a new debug assertion in std which trips on accessing otherwise,
    even if it does not really affect us.
    bluss committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    1433d63 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2023

  1. 0.3.3

    bluss committed Apr 20, 2023
    Configuration menu
    Copy the full SHA
    1f8d3c7 View commit details
    Browse the repository at this point in the history
  2. loopmacros: Use while loop

    The Range::next() method showed up uninlined in the benchmark, and this
    was a measuarable (~10%) improvement.
    bluss committed Apr 20, 2023
    Configuration menu
    Copy the full SHA
    15da77c View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    5bf5c7c View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2023

  1. Configuration menu
    Copy the full SHA
    34e740e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    fac92b6 View commit details
    Browse the repository at this point in the history
  3. ci: Test aarch64 at its MSRV

    bluss committed Apr 21, 2023
    Configuration menu
    Copy the full SHA
    d6f7a34 View commit details
    Browse the repository at this point in the history
  4. threading: Remove bias for aarch64

    Remove the bias factor for matrix size for aarch64 in computing max number
    of threads for a given matrix.
    bluss committed Apr 21, 2023
    Configuration menu
    Copy the full SHA
    fe2b237 View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2023

  1. uninline c_to_beta_c

    This function is a special case and should never be inlined, so put it out
    to the side.
    bluss committed Apr 25, 2023
    Configuration menu
    Copy the full SHA
    c5d1930 View commit details
    Browse the repository at this point in the history
  2. gemm: Use slice for packing buffer

    Using a reference type (such as a slice) for either pack or a in the packing
    function makes rustc emit a noalias annotation for that pointer, and that helps
    the optimizer in some cases.
    
    What we want is that the compiler sees that the pointers pack and a and
    pointers derived from them, can never alias, then it has more freedom to
    rewrite the operations in the packing loops.  The pack buffer is contiguous so
    it's the only choice for passing one of the two arguments as a slice.
    
    Shown to slightly speed up the layout_f32 benchmark for sgemm, not dgemm, on
    M1.
    
    A way to get the same effect without a slice would be good for this crate,
    like a 'restrict' keyword.
    bluss committed Apr 25, 2023
    Configuration menu
    Copy the full SHA
    0fea705 View commit details
    Browse the repository at this point in the history
  3. Use build script to preserve MSRV on aarch64

    Check rust version in build script and conditionally use new kernels.
    
    One could choose to just raise msrv specifically for aarch64 to 1.61 here,
    but being unobtrusive should be the right thing to do. There are likely more
    version sensitive features coming because of the subject (simd, asm).
    
    Autocfg seems like one of the best choices for version check. It's already
    used by num crates.
    bluss committed Apr 25, 2023
    Configuration menu
    Copy the full SHA
    058d3ef View commit details
    Browse the repository at this point in the history

Commits on Apr 26, 2023

  1. Merge pull request #73 from bluss/arm64

    Arm64/AArch64 Neon kernels
    bluss authored Apr 26, 2023
    Configuration menu
    Copy the full SHA
    35c258d View commit details
    Browse the repository at this point in the history

Commits on Apr 28, 2023

  1. 0.3.4

    bluss committed Apr 28, 2023
    Configuration menu
    Copy the full SHA
    5e0aea7 View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2023

  1. gemm: Allow custom packing functions

    For complex, we'll want to use a different packing function.
    Add packing into the GemmKernel interface so that kernels can request a
    different packing function. The standard packing function is unchanged but
    gets its own module in the code.
    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    b85cfa1 View commit details
    Browse the repository at this point in the history
  2. complex: pack real and imag separately

    Use a different pack function for complex micorkernels which puts real and
    imag parts in separate rows. This enables much better autovectorization for
    the fallback kernels.
    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    9896879 View commit details
    Browse the repository at this point in the history
  3. cgemm: Setup Avx2 and Fma autovectorized kernels

    Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does
    better than fma here, so both can be worthwhile.
    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    6f86fd9 View commit details
    Browse the repository at this point in the history
  4. x86-64: Specialize pack function for avx2

    Because we detect target features to select kernel, and the kernel
    can select its own packing functions, we can now specialize the packing
    functions per target.
    
    As matrices get larger, the packing performance matters much less, but
    for small matrix products it contributes more to the runtime.
    
    The default packing also already has a special case for contiguous
    matrices, which happens when in C = A B, A is column major and B is row
    major. The specialization in this commit helps the most outside this
    special case.
    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    2c536f2 View commit details
    Browse the repository at this point in the history
  5. cgemm: use fma in avx2 kernel

    avx2, fma and f32::mul_add is a success in autovectorization, while just
    fma with f32::mul_add is not (!).
    
    For this reason, only call f32::mul_add when we opt in to this.
    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    18bd827 View commit details
    Browse the repository at this point in the history
  6. cgemm: Add known-answer test

    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    e6d04e1 View commit details
    Browse the repository at this point in the history
  7. cgemm: enable fma for neon

    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    e84562d View commit details
    Browse the repository at this point in the history
  8. ci: Update miri flags

    Remove flags that are now used by default by miri.
    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    84c0baa View commit details
    Browse the repository at this point in the history
  9. 0.3.5

    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    258a69f View commit details
    Browse the repository at this point in the history
  10. Fix nostd build

    cgemm was not tested as nostd in ci
    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    145f9e8 View commit details
    Browse the repository at this point in the history
  11. 0.3.6

    bluss committed Apr 30, 2023
    Configuration menu
    Copy the full SHA
    d88b19e View commit details
    Browse the repository at this point in the history

Commits on May 2, 2023

  1. Remove space from file names

    For Bazel compatibility. Fixes #78
    xander-zitara authored May 2, 2023
    Configuration menu
    Copy the full SHA
    496f08a View commit details
    Browse the repository at this point in the history
  2. 0.3.7

    bluss committed May 2, 2023
    Configuration menu
    Copy the full SHA
    836e5ae View commit details
    Browse the repository at this point in the history

Commits on May 6, 2023

  1. bench: Add non-contiguous layouts

    in layout benchmarks, which are used to check packing and kernel sensitivity
    to memory layout, test some non-contiguous layouts.
    bluss committed May 6, 2023
    Configuration menu
    Copy the full SHA
    d6aef69 View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2023

  1. gemm: request 8-byte buffer alignment on macos

    A user showed that in certain configurations on macos, the TLS allocation can
    even be 8-byte aligned.
    bluss committed Sep 20, 2023
    Configuration menu
    Copy the full SHA
    c6f86de View commit details
    Browse the repository at this point in the history
  2. ci: Drop 1.41 in cross test

    Cargo cross does not support this old rust version anymore, increase
    cross versions.
    bluss committed Sep 20, 2023
    Configuration menu
    Copy the full SHA
    86f4432 View commit details
    Browse the repository at this point in the history
  3. gemm: Ensure alignment without repr(align()) on macos

    Completely distrust repr(align()) on macos and always manually ensure basic
    alignment.
    bluss committed Sep 20, 2023
    Configuration menu
    Copy the full SHA
    7753f81 View commit details
    Browse the repository at this point in the history
  4. 0.3.8

    bluss committed Sep 20, 2023
    Configuration menu
    Copy the full SHA
    e8caf74 View commit details
    Browse the repository at this point in the history

Commits on Mar 9, 2024

  1. Remove obsolete lint directive

    bluss committed Mar 9, 2024
    Configuration menu
    Copy the full SHA
    a0bf1bb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    29f3d1c View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c7ab1ac View commit details
    Browse the repository at this point in the history

Commits on Jul 27, 2024

  1. Fix alignment in s390x and cross test

    Requested 32-alignment for s390x but thread local storage does not
    supply it. Lower requested align to 16 in general to avoid having this
    problem pop up on other platforms too.
    bluss committed Jul 27, 2024
    Configuration menu
    Copy the full SHA
    77ed4e0 View commit details
    Browse the repository at this point in the history
  2. 0.3.9

    bluss committed Jul 27, 2024
    Configuration menu
    Copy the full SHA
    bb3dd0b View commit details
    Browse the repository at this point in the history