Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up vectorutil float scalar methods, unroll properly, use fma where possible #12737

Merged
merged 34 commits into from
Nov 4, 2023

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Oct 31, 2023

Apply same logic to float scalar functions as we have for vector functions. Really doing the same instructions after all, just different width. So they shouldn't be so crazy different...

The scalar functions currently have various ad-hoc unrolling that doesn't keep the CPU's FPUs properly busy, and have data dependencies and stuff that clogs everything up and prevents parallelization: this fixes that.

This gives e.g. 2X speedup to cosine on both x86 and arm, but all the functions get speedups of some sort on both architectures. FMA is used on x86 where available, to really keep the cpu busy, same as the vector case. It is not used on ARM.

main (skylake):

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.634 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.979 ± 0.034  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.433 ± 0.017  ops/us

patch (skylake): Good speedup across the board, e.g. 2x for cosine

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1.328 ± 0.020  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   2.765 ± 0.032  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   2.457 ± 0.038  ops/us

patch (skylake, -XX:-UseFMA): Still decent speedup even without using FMA, just not as much

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.861 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   2.223 ± 0.034  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.873 ± 0.020  ops/us

Main (arm):

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1.074 ± 0.009  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   3.598 ± 0.009  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   3.191 ± 0.006  ops/us

Patch (arm): Good speedups, 2x for cosine). FMA is not used.

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   2.061 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   4.272 ± 0.003  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   3.766 ± 0.013  ops/us

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

with all the data dependencies removed, i also gave at least one stab trying to see if i could trick the compiler into using packed instructions instead of single floats... would be awesome if we could just delete our vector code for float!!! didn't succeed but deserves a few more tries i think.

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

e.g. for dotproduct case, with this patch, despite there being no data dependencies, compiler literally does 4 VFMADD*SS in the loop with different xmm registers. Instead of just doing 1 VFMADD*PS with one xmm register containing the 4 floats. Seems dumb to me.

float acc2 = 0;
float acc3 = 0;
float acc4 = 0;
int upperBound = a.length & ~(4 - 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is from the javadocs of Species.loopBound of vector api where width is a power of 2. I used it in these functions for consistency, and because i assume it means the compiler will do a good job. we could maybe put in a static method for other places doing crap like this (e.g. stringhelper's murmurhash) as a followup? I'm guessing any other places do it ad-hoc like what was here before. I wanted to keep this PR minimally invasive though.

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

I tried naively writing the logic like this with a couple N (8, 16, 32,etc) with FMA both off and on to see if I can baby this compiler to vectorize, nope, nothing. I don't think autovectorization works except for BitSet :)

// loop one, where's my vectorization? no floating point excuses here. tried with fma too.
float acc[] = new float[32];
int upperBound = a.length & ~(32 - 1);
for (; i < upperBound; i += 32) {
  for (int j = 0; j < acc.length; j++) {
    acc[j] = a[i+j] * b[i+j] + acc[j];
  }
}
// second reduction loop
for (int j = 0; j < acc.length; j++) {
  res += acc[j];
}

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

Also it was just previously confusing to see stuff like vector benchmark results with 128-bit arm vectors going 8x faster than 32-bit floats which makes no logical sense. e.g. with this change, neon-128 dot product is 3.95x faster than 32-bit float dot product.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Thanks for moving the detection method to constants class.

I was thinking about this, too, but I was afraid that a non local constant (and public) is not treated the same by hotspot.

@uschindler
Copy link
Contributor

uschindler commented Oct 31, 2023

One thing: we should benchmark this in Lucene 9.x with java 11, too. I want to make sure we get same or similar speedup there. Not that there will be a regression.

@uschindler
Copy link
Contributor

uschindler commented Oct 31, 2023

Did you also benchmark this PR with java 17? Your benchmark does not say which version it uses.

@dweiss
Copy link
Contributor

dweiss commented Oct 31, 2023

Nice. I checked with AMD Ryzen Threadripper 3970X. Note it's actually slightly faster when not using FMA...

JDK 17, patch
java -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.611 ± 0.009  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.324 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  3.305 ± 0.053  ops/us
java -XX:-UseFMA -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  2.028 ± 0.012  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  4.045 ± 0.011  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  3.254 ± 0.043  ops/us
JDK 17, main (85f5d3bb0bf)
java -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.192 ± 0.006  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.837 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.523 ± 0.014  ops/us
java -XX:-UseFMA -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.196 ± 0.006  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.841 ± 0.015  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.525 ± 0.010  ops/us

@uschindler
Copy link
Contributor

@dweiss Are you sure the -useFMA was applied to the child jvms of benchmark. As far as I remember you have to pass the JVM options as separate parameter to be applied to childs.

@dweiss
Copy link
Contributor

dweiss commented Oct 31, 2023

JMH states that:

  -jvmArgs <string>           Use given JVM arguments. Most options are inherited
                              from the host VM options, [...]

and I've always relied on that. Here's an explicit version though (on main, JDK17) - it is consistent with the previous output:

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.222 ±  0.001  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.837 ±  0.007  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.519 ±  0.009  ops/us

@dweiss
Copy link
Contributor

dweiss commented Oct 31, 2023

Out of curiosity I checked (by sysouting the status of the constant): -XX:+-UseFMA is passed to forked JVMs, so these incantations of jmh are equivalent:

java -XX:-UseFMA -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024
java -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024 -jvmArgs "-XX:-UseFMA"

@dweiss
Copy link
Contributor

dweiss commented Oct 31, 2023

Here's how JMH copies runner VM's input arguments -

        jvmArgs.addAll(options.getJvmArgs()
                .orElseGet(() -> benchmark.getJvmArgs()
                .orElseGet(() -> ManagementFactory.getRuntimeMXBean().getInputArguments())
        ));

You're right that it's kind of vague what that bean actually returns. In this case though, it works.

@uschindler
Copy link
Contributor

Thanks, great. By the way if you run the incubator impl it prints the FMA status as log message. Just the scalar one doesn't.

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

Nice. I checked with AMD Ryzen Threadripper 3970X. Note it's actually slightly faster when not using FMA...

@dweiss thanks for testing, I expect this on AMD. but "slightly" is very very slight, i think it is best to still use FMA here. https://www.reddit.com/r/cpp/comments/xtoj93/comment/iqrmmuo/

@uschindler
Copy link
Contributor

Yeah. Please keep it as it gives more correct results.

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

hmm i think i read @dweiss results in the wrong order... it seems like a fairly big difference? we actually regress scalar dotproduct for his cpu here. And I assume the vector case behaves the same way?

if we want to avoid it on zen2 I am fine with it, if you tell me how to detect it. But i assume on zen3 things are fine, fma has lower latency.

@uschindler
Copy link
Contributor

I will check on my Zen on policeman Jenkins later.

You can't get zen version or detect Ryzen CPUs easily. Please do not add cpuinfo parsing as this won't work on alternate platforms like Windows.

I am fine with having a slowdown for more correctness.

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

You can't get zen version or detect Ryzen CPUs easily. Please do not add cpuinfo parsing as this won't work on alternate platforms like Windows.

We can just retrieve another flag to see it. openjdk already detects stuff like this and then sets flags accordingly, so we can infer it.

https://github.com/openjdk/jdk/blob/e05cafda78a37dbeb2df2edd791be19d22edaece/src/hotspot/cpu/x86/vm_version_x86.cpp#L1463

I will make openjdk "give up the goods" despite it trying to hide them :)

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

we can also detect newer zen (e.g. zen4) by it having 512-bit vectors. so we just need to detect AMD vs Intel.

anyway, i'm fine with removing FMA from this PR completely actually, and revisiting its usage for vectors on AMD cpus. I'm not trying to make the code go slower.

@uschindler
Copy link
Contributor

Please keep FMA for now and allow us benchmarking.

@rmuir
Copy link
Member Author

rmuir commented Oct 31, 2023

This should solve @dweiss problem: f2be84f

It should also improve speed of vectorized case on AMD in the same way.

@uschindler
Copy link
Contributor

This should solve @dweiss problem: f2be84f

It should also improve speed of vectorized case on AMD in the same way.

So basically this detects SSE4a, which is AMD only.

@mikemccand
Copy link
Member

mikemccand commented Oct 31, 2023

Thanks @rmuir!

Results from the nightly benchy box (128 core Ryzen Threadripper 3990X) -- WOOPS this is JDK 17. I'm running again with JDK 20:

main:

Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.206 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt    5  1.204 ±  0.001  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.851 ±  0.002  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  3.851 ±  0.003  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.516 ±  0.092  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  2.541 ±  0.001  ops/us

PR:

Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  2.038 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt    5  2.030 ±  0.005  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.999 ±  0.021  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  4.054 ±  0.003  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  3.181 ±  0.740  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  3.241 ±  0.031  ops/us

Very nice speedups!

@rmuir
Copy link
Member Author

rmuir commented Nov 3, 2023

AMD EPYC 9R14:
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

main:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.841 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.540 ± 0.001  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.539 ± 0.010  ops/us

patch:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.764 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.584 ± 0.002  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.562 ± 0.004  ops/us

@uschindler
Copy link
Contributor

What are the results for vector API with this CPU?

@rmuir
Copy link
Member Author

rmuir commented Nov 3, 2023

vector results for this AMD CPU are unchanged by this PR.

Float-relevant performance info from avxturbo.
This CPU doesn't downclock but 512-bit FMA is 2x as slow as 256-bit FMA, so i did some experiments...

Cores | ID                  | Description                       | OVRLP3 |  Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | avx128_fma_t        | 128-bit parallel DP FMAs          |  1.000 |  7402 |      1.42 |    3700 |        1.00
1     | avx256_fma_t        | 256-bit parallel DP FMAs          |  1.000 |  7402 |      1.42 |    3700 |        1.00
1     | avx512_fma_t        | 512-bit parallel DP FMAs          |  1.000 |  3700 |      1.42 |    3700 |        1.00

Float:
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.397 ± 0.205  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.226 ± 0.434  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.147 ± 0.394  ops/us

Float (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  11.234 ± 0.041  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  17.045 ± 0.436  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.876 ± 0.351  ops/us

Binary-relevant performance info from avxturbo:

Cores | ID                  | Description                       | OVRLP3 |  Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | avx128_imul         | 128-bit integer muls (vpmuldq)    |  1.000 |  1233 |      1.42 |    3700 |        1.00
1     | avx256_imul         | 256-bit integer muls (vpmuldq)    |  1.000 |  1233 |      1.42 |    3700 |        1.00
1     | avx512_imul         | 512-bit integer muls (vpmuldq)    |  1.000 |  1233 |      1.42 |    3700 |        1.00

Binary:
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15   8.769 ± 0.083  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  22.362 ± 0.054  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  18.080 ± 0.171  ops/us

Binary (512-bit vectors but disabling Intel-specific downclock-protection / doing 32-bit vpmul)
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15  10.669 ± 0.242  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  21.148 ± 0.087  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  18.048 ± 0.142  ops/us

Binary (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15   8.773 ± 0.006  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  17.484 ± 0.022  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  14.930 ± 0.018  ops/us

@rmuir
Copy link
Member Author

rmuir commented Nov 3, 2023

So you can see the difference in approach. Personally i prefer how this AMD AVX-512 works: that for some operations, the 512-bit variant just isn't any faster than the 256-bit variant, versus intel's approach of slowing down other things on the computer :)

@rmuir
Copy link
Member Author

rmuir commented Nov 4, 2023

I tweaked the FMA logic for AMD cpus, to only avoid the high-latency scalar FMA where necessary. Should appease germans to get that extra ulp or whatever.

sysprops default to "auto" so you can override however you want, without fear of involving BigDecimal :)

I can test the intel and arm families in the same way and try to tighten it up tomorrow.

AMD Zen4: EPYC 9R14 (family 0x19)

Main:
Benchmark                                  (size)   Mode  Cnt   Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.842 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.497 ±  0.171  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.540 ±  0.002  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.441 ±  0.424  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.540 ±  0.008  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.655 ±  0.575  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.763 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.477 ± 0.168  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.583 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.438 ± 0.493  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.560 ± 0.009  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  15.778 ± 0.114  ops/us

AMD Zen3: EPYC 7R13 (family 0x19)

Main:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.982 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75  10.476 ± 0.026  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   3.246 ± 0.015  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  16.959 ± 0.480  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.298 ± 0.010  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  16.342 ± 0.508  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.344 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  10.445 ± 0.048  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.405 ± 0.006  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.486 ± 0.374  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.995 ± 0.002  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.374 ± 0.462  ops/us

AMD Zen2: EPYC 7R32 (family 0x17)

Main:
Benchmark                                   (size)   Mode  Cnt   Score    Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.922 ±  0.005  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75   8.519 ±  0.020  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   2.968 ±  0.020  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  15.950 ±  0.486  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.015 ±  0.012  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  15.894 ±  0.331  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.200 ± 0.005  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.520 ± 0.018  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.114 ± 0.021  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  15.671 ± 0.439  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.490 ± 0.030  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  15.189 ± 0.170  ops/us

@mikemccand
Copy link
Member

Thank you @rmuir for doing all the crazy hard work to decode the actual capabilities of the bare metal hiding underneath the layers of abstraction under Panama Vector API @rmuir! I love the CONSTANTS approach.

versus intel's approach of slowing down other things on the computer :)

!!

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rmuir,
I polished the documentation of VectorUtil a bit, so people know the knobs how to enable the incubation module and how to tune FMA. This looks now fine to me.

The good thing with the three-state sysprop is: you can run benchmarks for testing newer CPUs easily without modifying the code.

@uschindler
Copy link
Contributor

uschindler commented Nov 4, 2023

@rmuir: It would be nice if you could follow the community standard and merge this long PR with Github UI and squash it - thanks. I can do it for you if you like.

@mikemccand
Copy link
Member

I tested on my now-ancient Zen2 beast3 (nightly benchmark) box (AMD Ryzen Threadripper 3990X 64-Core Processor), using JDK 21 (openjdk full version "21+35"), with command-line ./gradlew clean; ./gradlew -p lucene/benchmark-jmh assemble; java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar float -p size=1024.

[An aside: strangely, to test the PR, I normally download and apply the .diff or .patch using patch -p1 < X.diff/patch, but for this PR there are non-trivial (to me!) conflicts reported by patch. So instead I ran the suggested github command-line steps for merging, and got a clean applied version of this PR to run the benchy.]

main:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.176 ± 0.011  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  11.015 ± 0.029  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.870 ± 0.011  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  22.879 ± 0.407  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.604 ± 0.023  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  21.293 ± 0.289  ops/us

PR:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.553 ± 0.009  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  10.995 ± 0.025  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   4.051 ± 0.029  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  22.887 ± 0.396  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.254 ± 0.008  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  21.238 ± 0.420  ops/us

@uschindler
Copy link
Contributor

uschindler commented Nov 4, 2023

[An aside: strangely, to test the PR, I normally download and apply the .diff or .patch using patch -p1 < X.diff/patch, but for this PR there are non-trivial (to me!) conflicts reported by patch. So instead I ran the suggested github command-line steps for merging, and got a clean applied version of this PR to run the benchy.]

There are some bugs in Github since yesterday (they also 404 not found PRs for some time). Actually the patch is completely unuseable as it partly contains also merged information. The diff looks fine to me.

My recommendation: You can merge the PR into your branch using the command line provided by Github or - much easier - add Robert's repository as rmuir upsteam. I have the common repos by Robert, Chris already available in my git config, so it's simple to check them out and work directly on them.

@rmuir
Copy link
Member Author

rmuir commented Nov 4, 2023

@rmuir: It would be nice if you could follow the community standard and merge this long PR with Github UI and squash it - thanks. I can do it for you if you like.

I am not done here yet, I want to benchmark and try to tighten the intel and arm models first too. At least do the best i can to get the best performance out of all of them.

whether to squash or not is my decision. Just like maybe the community standard is intellij, i use vim.

@rmuir
Copy link
Member Author

rmuir commented Nov 4, 2023

Benchmarks for the intel cpus. There is one place i'd fix, if we could detect sapphire rapids and avoid scalar FMA. But i have no way to detect it based on what new features it has / what openjdk exposes at the moment. Otherwise performance is good.

Sapphire Rapids:

Main:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.871 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75  13.907 ± 0.266  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   4.275 ± 0.023  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  22.218 ± 0.759  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.819 ± 0.004  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  20.243 ± 0.352  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.650 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.799 ± 0.233  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.612 ± 0.012  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  23.300 ± 1.079  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.884 ± 0.004  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  20.449 ± 0.446  ops/us

Ice Lake:

Main:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.547 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75   9.842 ± 0.334  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   2.471 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  13.452 ± 0.455  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   1.749 ± 0.004  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  11.813 ± 0.456  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.528 ± 0.003  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   9.919 ± 0.345  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.314 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  13.137 ± 0.155  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.248 ± 0.025  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  11.920 ± 0.469  ops/us

Cascade Lake:

Main:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.578 ± 0.005  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.907 ± 0.095  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   1.742 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  13.935 ± 0.129  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   1.347 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  12.526 ± 0.132  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.641 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.823 ± 0.114  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.401 ± 0.014  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  13.874 ± 0.116  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.629 ± 0.016  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  12.462 ± 0.123  ops/us

Haswell:

Main:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.728 ± 0.005  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   6.781 ± 0.071  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   1.730 ± 0.034  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  10.603 ± 0.351  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   1.398 ± 0.060  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75   9.470 ± 0.286  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.199 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   6.775 ± 0.083  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   2.465 ± 0.017  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  10.410 ± 0.300  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.299 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75   9.117 ± 0.118  ops/us

use a heuristic that may upset fanboys but is really practical and simple
@rmuir
Copy link
Member Author

rmuir commented Nov 4, 2023

Here are the ARMs. I had to tweak ARM to use FMA more aggressively to fully utilize the gravitons. The problem there is just apple silicon, it is good we did not move forwards with benchmarks based solely on some macs. You may not like my detector, but I think it is quite practical and prevents slow execution.

Graviton 3

Main:
Benchmark                                  (size)   Mode  Cnt   Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.682 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   5.500 ±  0.004  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   2.411 ±  0.037  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.522 ±  0.234  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.169 ±  0.005  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75   8.632 ±  0.084  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.422 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   6.911 ± 0.039  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.751 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.498 ± 0.418  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.202 ± 0.007  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  10.795 ± 0.154  ops/us

Graviton 2

Main:
Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15  0.647 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  2.599 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15  1.430 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  6.192 ± 0.098  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15  1.194 ± 0.003  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  4.797 ± 0.088  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15  1.571 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  5.408 ±  0.013  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15  2.055 ±  0.066  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  6.673 ±  0.260  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15  1.753 ±  0.001  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  6.179 ±  0.070  ops/us

Mac M1

Main:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.077 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.651 ± 0.032  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.606 ± 0.032  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.296 ± 0.268  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.197 ± 0.001  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.185 ± 0.099  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   2.062 ± 0.006  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.644 ± 0.030  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   4.273 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.110 ± 0.283  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.770 ± 0.007  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.184 ± 0.100  ops/us

@uschindler
Copy link
Contributor

You may not like my detector, but I think it is quite practical and prevents slow execution.

The detector is funny, but it won't detect slow apple silicon if you run Linux on the Mac. But I agree it is ok.

It is good that we have the sysprops to enforce FMA or disable it, overriding default detection if needed. So on apple chips with Linux you can disable it. 👻

@rmuir
Copy link
Member Author

rmuir commented Nov 4, 2023

It is good that we have the sysprops to enforce FMA or disable it, overriding default detection if needed. So on apple chips with Linux you can disable it. 👻

exactly. we can't detect all cases perfectly or predict the future. but I don't want this to be a hassle: and want things to be fast by default everywhere if at all possible (without complex logic). Hence the simple heuristic. If there is a problem with it, there's a workaround (sysprop).

@rmuir
Copy link
Member Author

rmuir commented Nov 4, 2023

for transparency, this was my testing procedure. I did lots of other things such as poking around and running experiments too, but for the basics of "running benchmark across different instance types", it can all be easily automated with tools like ansible and run all in parallel. the question is, how to visualize the data?

# login
ssh -i robkeypair.pem ec2-user@<ip>
# disable system slowdowns
sudo grubby --remove-args="selinux=1 security=selinux quiet" --args="mitigations=0 random.trust_cpu=1 loglevel=7 selinux=0" --update-kernel=ALL && sudo reboot
# login again
ssh -i robkeypair.pem ec2-user@<ip>
-if x86
  # install packages for testing
  sudo yum install -y git g++ make
  # clone avx-turbo
  git clone [email protected]:travisdowns/avx-turbo.git
  # build avx-turbo
  cd avx-turbo; make
  # load msr module
  sudo modprobe msr
  # run avx-turbo
  sudo ./avx-turbo
  # examine results, look for any oddities
    # look at avx*_imul, avx*_fma, and avx*_fma_t.
    # check ratio of avx512_imul to avx256_imul and look at clock difference
    # check ratio of avx512_fma_t to avx256_fma_t and look at clock difference
    # check ratio of avx*_fma_t to avx*_fma (divided by 2 for HT)
  cd ..
  curl -f https://download.java.net/java/GA/jdk21.0.1/415e3f918a1f4062a0074a2794853d0d/12/GPL/openjdk-21.0.1_linux-x64_bin.tar.gz | tar -zxvf -
-else aarch64
  sudo yum install -y git
  curl -f https://download.java.net/java/GA/jdk21.0.1/415e3f918a1f4062a0074a2794853d0d/12/GPL/openjdk-21.0.1_linux-aarch64_bin.tar.gz | tar -zxvf -
-endif
# download java
# configure java (also in case i get disconnected)
echo 'export JAVA_HOME=/home/ec2-user/jdk-21.0.1' >> ~/.bashrc
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
# prevent benchmark interference from daemon
mkdir ~/.gradle
echo 'org.gradle.daemon=false' > ~/.gradle/gradle.properties
# clone lucene
git clone [email protected]:rmuir/lucene.git; cd lucene
# run benchmark (main)
./gradlew -p lucene/benchmark-jmh assemble
java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar float -p size=1024
# run benchmark (patch)
git checkout float_scalar_fma_unroll
./gradlew -p lucene/benchmark-jmh assemble
java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar float -p size=1024

@rmuir
Copy link
Member Author

rmuir commented Nov 4, 2023

and yeah, the avx-turbo is measuring double precision when it "benches" FMA and we do float precision, i know. but its code already written and a nice non-java way to get the wanted info, and seems pretty in line with the results.

@uschindler uschindler closed this Nov 4, 2023
@uschindler uschindler reopened this Nov 4, 2023
@uschindler
Copy link
Contributor

Sorry, pressed wrong button. Reopened.

@rmuir rmuir merged commit 40e55b0 into apache:main Nov 4, 2023
8 checks passed
asfgit pushed a commit that referenced this pull request Nov 4, 2023
@uschindler
Copy link
Contributor

Thanks for the hard benchmarking work! 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.