[pull] master from bluss:master #3

pull · 2020-12-07T20:50:03Z

See Commits and Changes for more details.

Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

Co-authored-by: Geordon Worley <[email protected]>

no_std support

Add fast test switch so we can skip big matrices in cross tests

Add github actions to replace travis

This binary makes it easy to run custom benchmarks of bigger matrices with custom size, layout and threading.

…ling

Add benchmark runner as an "example" binary

…ads)

…bled

For std and threading we can use the thread_local!() macro, but for no-std we'll need to use a stack array instead.

It depends a lot on hardware, when we should use threads, so even having a heuristic is risky, but we'll add one and can improve it later.

On non-x86, this macro can be unused.

These are used all the time for profiling (and only affects development, so they might as well be enabled.)

This is a performance fix, using one Lazy/OnceCell instead of two separate ones saves a little time - it's just a few ns - which was visible in the benchmark for (too) small matrices.

Use repr(align(x)) so we don't have to oversize and manually align the mask buffer. Also use an UnsafeCell to remove (the very small, few ns) overhead of borrowing the RefCell. (Its borrowing was pointless anyway, since we held the raw pointer much longer than RefCell "borrow".)

Remove the bias factor for matrix size for aarch64 in computing max number of threads for a given matrix.

This function is a special case and should never be inlined, so put it out to the side.

Using a reference type (such as a slice) for either pack or a in the packing function makes rustc emit a noalias annotation for that pointer, and that helps the optimizer in some cases. What we want is that the compiler sees that the pointers pack and a and pointers derived from them, can never alias, then it has more freedom to rewrite the operations in the packing loops. The pack buffer is contiguous so it's the only choice for passing one of the two arguments as a slice. Shown to slightly speed up the layout_f32 benchmark for sgemm, not dgemm, on M1. A way to get the same effect without a slice would be good for this crate, like a 'restrict' keyword.

Check rust version in build script and conditionally use new kernels. One could choose to just raise msrv specifically for aarch64 to 1.61 here, but being unobtrusive should be the right thing to do. There are likely more version sensitive features coming because of the subject (simd, asm). Autocfg seems like one of the best choices for version check. It's already used by num crates.

Arm64/AArch64 Neon kernels

For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.

Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.

Because we detect target features to select kernel, and the kernel can select its own packing functions, we can now specialize the packing functions per target. As matrices get larger, the packing performance matters much less, but for small matrix products it contributes more to the runtime. The default packing also already has a special case for contiguous matrices, which happens when in C = A B, A is column major and B is row major. The specialization in this commit helps the most outside this special case.

avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.

Remove flags that are now used by default by miri.

cgemm was not tested as nostd in ci

For Bazel compatibility. Fixes #78

in layout benchmarks, which are used to check packing and kernel sensitivity to memory layout, test some non-contiguous layouts.

A user showed that in certain configurations on macos, the TLS allocation can even be 8-byte aligned.

Cargo cross does not support this old rust version anymore, increase cross versions.

Completely distrust repr(align()) on macos and always manually ensure basic alignment.

Requested 32-alignment for s390x but thread local storage does not supply it. Lower requested align to 16 in general to avoid having this problem pop up on other platforms too.

jturner314 and others added 4 commits December 6, 2020 16:14

add no_std support

5d7ae23

Co-authored-by: Geordon Worley <[email protected]>

Merge pull request #51 from vadixidav/no_std

97921e9

no_std support

MAINT: Silence unused items warnings (these fire on non-x86)

6243d28

0.2.4

79b57a3

pull bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Dec 7, 2020

bluss added 24 commits December 28, 2020 20:58

TEST: Add github actions to replace travis

48fdf21

Add fast test switch so we can skip big matrices in cross tests

Merge pull request #53 from bluss/gh-actions

11ec355

Add github actions to replace travis

DOC: Remove travis badge in readme

a713a6b

TEST: Add benchmark runner as example binary

b81d267

This binary makes it easy to run custom benchmarks of bigger matrices with custom size, layout and threading.

TEST: Add csv output format to benchmark program and fixup error hand…

fc30d9d

…ling

TEST: Skip testing examples on MSRV

e926f0a

Merge pull request #54 from bluss/benchmark

319e49e

Add benchmark runner as an "example" binary

API: Update to Rust 2018 edition

a3fd081

FIX: Use Ptr wrappers for raw pointers (mark safe to pass across thre…

cb0ca4b

…ads)

Add function that splits a range chunk in parts

eb5582b

FEAT: Add threading feature using a hierarchical thread pool

01e8ba2

FEAT: Suport nthreads 2, 3, and 4 in parallel loops

860ec38

FEAT: Add method for num_pack_a

8b39aae

TEST: Test threading feature

a761cfc

TEST: Test from 1.42

e2040fc

FIX: Let the "thread_local" function be FnOnce when threading is disa…

72d036f

…bled

FIX: Only use thread local if have std

55ffa7f

For std and threading we can use the thread_local!() macro, but for no-std we'll need to use a stack array instead.

TEST: Cleanup in gh actions file

a0343ff

FIX: Add heuristic to avoid using threads for small matrices

6b3158c

It depends a lot on hardware, when we should use threads, so even having a heuristic is risky, but we'll add one and can improve it later.

MAINT: Disable warning for unused macro

9879d9e

On non-x86, this macro can be unused.

MAINT: Enable debug info in release/bench mode

e941ba3

These are used all the time for profiling (and only affects development, so they might as well be enabled.)

DOC: Update crate docs for the threading feature

04264b0

FIX: Put threadpool and nthreads into one combined Lazy

33951b5

This is a performance fix, using one Lazy/OnceCell instead of two separate ones saves a little time - it's just a few ns - which was visible in the benchmark for (too) small matrices.

bluss and others added 30 commits April 21, 2023 22:19

ci: Test aarch64 at its MSRV

d6f7a34

threading: Remove bias for aarch64

fe2b237

Remove the bias factor for matrix size for aarch64 in computing max number of threads for a given matrix.

uninline c_to_beta_c

c5d1930

This function is a special case and should never be inlined, so put it out to the side.

Merge pull request #73 from bluss/arm64

35c258d

Arm64/AArch64 Neon kernels

0.3.4

5e0aea7

gemm: Allow custom packing functions

b85cfa1

For complex, we'll want to use a different packing function. Add packing into the GemmKernel interface so that kernels can request a different packing function. The standard packing function is unchanged but gets its own module in the code.

complex: pack real and imag separately

9896879

Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

cgemm: Setup Avx2 and Fma autovectorized kernels

6f86fd9

Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.

cgemm: use fma in avx2 kernel

18bd827

avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.

cgemm: Add known-answer test

e6d04e1

cgemm: enable fma for neon

e84562d

ci: Update miri flags

84c0baa

Remove flags that are now used by default by miri.

0.3.5

258a69f

Fix nostd build

145f9e8

cgemm was not tested as nostd in ci

0.3.6

d88b19e

Remove space from file names

496f08a

For Bazel compatibility. Fixes #78

0.3.7

836e5ae

bench: Add non-contiguous layouts

d6aef69

in layout benchmarks, which are used to check packing and kernel sensitivity to memory layout, test some non-contiguous layouts.

gemm: request 8-byte buffer alignment on macos

c6f86de

A user showed that in certain configurations on macos, the TLS allocation can even be 8-byte aligned.

ci: Drop 1.41 in cross test

86f4432

Cargo cross does not support this old rust version anymore, increase cross versions.

gemm: Ensure alignment without repr(align()) on macos

7753f81

Completely distrust repr(align()) on macos and always manually ensure basic alignment.

0.3.8

e8caf74

Remove obsolete lint directive

a0bf1bb

ci: Test with cargo-careful and ThreadSanitizer

29f3d1c

ci: Update github action versions

c7ab1ac

Fix alignment in s390x and cross test

77ed4e0

Requested 32-alignment for s390x but thread local storage does not supply it. Lower requested align to 16 in general to avoid having this problem pop up on other platforms too.

0.3.9

bb3dd0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from bluss:master #3

[pull] master from bluss:master #3

pull bot commented Dec 7, 2020 •

edited

Loading

[pull] master from bluss:master #3

Are you sure you want to change the base?

[pull] master from bluss:master #3

Conversation

pull bot commented Dec 7, 2020 • edited Loading

pull bot commented Dec 7, 2020 •

edited

Loading