Speed up vectorutil float scalar methods, unroll properly, use fma where possible #12737

rmuir · 2023-10-31T03:41:45Z

Apply same logic to float scalar functions as we have for vector functions. Really doing the same instructions after all, just different width. So they shouldn't be so crazy different...

The scalar functions currently have various ad-hoc unrolling that doesn't keep the CPU's FPUs properly busy, and have data dependencies and stuff that clogs everything up and prevents parallelization: this fixes that.

This gives e.g. 2X speedup to cosine on both x86 and arm, but all the functions get speedups of some sort on both architectures. FMA is used on x86 where available, to really keep the cpu busy, same as the vector case. It is not used on ARM.

main (skylake):

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.634 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.979 ± 0.034  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.433 ± 0.017  ops/us

patch (skylake): Good speedup across the board, e.g. 2x for cosine

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1.328 ± 0.020  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   2.765 ± 0.032  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   2.457 ± 0.038  ops/us

patch (skylake, -XX:-UseFMA): Still decent speedup even without using FMA, just not as much

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.861 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   2.223 ± 0.034  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.873 ± 0.020  ops/us

Main (arm):

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1.074 ± 0.009  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   3.598 ± 0.009  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   3.191 ± 0.006  ops/us

Patch (arm): Good speedups, 2x for cosine). FMA is not used.

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   2.061 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   4.272 ± 0.003  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   3.766 ± 0.013  ops/us

…ssible

rmuir · 2023-10-31T03:45:34Z

with all the data dependencies removed, i also gave at least one stab trying to see if i could trick the compiler into using packed instructions instead of single floats... would be awesome if we could just delete our vector code for float!!! didn't succeed but deserves a few more tries i think.

rmuir · 2023-10-31T03:50:38Z

e.g. for dotproduct case, with this patch, despite there being no data dependencies, compiler literally does 4 VFMADD*SS in the loop with different xmm registers. Instead of just doing 1 VFMADD*PS with one xmm register containing the 4 floats. Seems dumb to me.

rmuir · 2023-10-31T04:09:47Z

lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java

+      float acc2 = 0;
+      float acc3 = 0;
+      float acc4 = 0;
+      int upperBound = a.length & ~(4 - 1);


this logic is from the javadocs of Species.loopBound of vector api where width is a power of 2. I used it in these functions for consistency, and because i assume it means the compiler will do a good job. we could maybe put in a static method for other places doing crap like this (e.g. stringhelper's murmurhash) as a followup? I'm guessing any other places do it ad-hoc like what was here before. I wanted to keep this PR minimally invasive though.

rmuir · 2023-10-31T04:36:14Z

I tried naively writing the logic like this with a couple N (8, 16, 32,etc) with FMA both off and on to see if I can baby this compiler to vectorize, nope, nothing. I don't think autovectorization works except for BitSet :)

// loop one, where's my vectorization? no floating point excuses here. tried with fma too.
float acc[] = new float[32];
int upperBound = a.length & ~(32 - 1);
for (; i < upperBound; i += 32) {
  for (int j = 0; j < acc.length; j++) {
    acc[j] = a[i+j] * b[i+j] + acc[j];
  }
}
// second reduction loop
for (int j = 0; j < acc.length; j++) {
  res += acc[j];
}

rmuir · 2023-10-31T05:56:22Z

Also it was just previously confusing to see stuff like vector benchmark results with 128-bit arm vectors going 8x faster than 32-bit floats which makes no logical sense. e.g. with this change, neon-128 dot product is 3.95x faster than 32-bit float dot product.

uschindler

Looks fine. Thanks for moving the detection method to constants class.

I was thinking about this, too, but I was afraid that a non local constant (and public) is not treated the same by hotspot.

uschindler · 2023-10-31T08:12:23Z

One thing: we should benchmark this in Lucene 9.x with java 11, too. I want to make sure we get same or similar speedup there. Not that there will be a regression.

uschindler · 2023-10-31T08:13:25Z

Did you also benchmark this PR with java 17? Your benchmark does not say which version it uses.

dweiss · 2023-10-31T08:56:05Z

Nice. I checked with AMD Ryzen Threadripper 3970X. Note it's actually slightly faster when not using FMA...

JDK 17, patch
java -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.611 ± 0.009  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.324 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  3.305 ± 0.053  ops/us

java -XX:-UseFMA -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  2.028 ± 0.012  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  4.045 ± 0.011  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  3.254 ± 0.043  ops/us

JDK 17, main (85f5d3bb0bf)
java -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.192 ± 0.006  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.837 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.523 ± 0.014  ops/us

java -XX:-UseFMA -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.196 ± 0.006  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.841 ± 0.015  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.525 ± 0.010  ops/us

uschindler · 2023-10-31T09:05:12Z

@dweiss Are you sure the -useFMA was applied to the child jvms of benchmark. As far as I remember you have to pass the JVM options as separate parameter to be applied to childs.

dweiss · 2023-10-31T09:19:08Z

JMH states that:

  -jvmArgs <string>           Use given JVM arguments. Most options are inherited
                              from the host VM options, [...]

and I've always relied on that. Here's an explicit version though (on main, JDK17) - it is consistent with the previous output:

VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.222 ±  0.001  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.837 ±  0.007  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.519 ±  0.009  ops/us

dweiss · 2023-10-31T09:32:46Z

Out of curiosity I checked (by sysouting the status of the constant): -XX:+-UseFMA is passed to forked JVMs, so these incantations of jmh are equivalent:

java -XX:-UseFMA -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024
java -jar lucene\benchmark-jmh\build\benchmarks\lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar "VectorUtilBenchmark\.float.*Scalar" -p size=1024 -jvmArgs "-XX:-UseFMA"

dweiss · 2023-10-31T09:41:09Z

Here's how JMH copies runner VM's input arguments -

        jvmArgs.addAll(options.getJvmArgs()
                .orElseGet(() -> benchmark.getJvmArgs()
                .orElseGet(() -> ManagementFactory.getRuntimeMXBean().getInputArguments())
        ));

You're right that it's kind of vague what that bean actually returns. In this case though, it works.

uschindler · 2023-10-31T09:56:58Z

Thanks, great. By the way if you run the incubator impl it prints the FMA status as log message. Just the scalar one doesn't.

rmuir · 2023-10-31T11:25:11Z

Nice. I checked with AMD Ryzen Threadripper 3970X. Note it's actually slightly faster when not using FMA...

@dweiss thanks for testing, I expect this on AMD. but "slightly" is very very slight, i think it is best to still use FMA here. https://www.reddit.com/r/cpp/comments/xtoj93/comment/iqrmmuo/

uschindler · 2023-10-31T11:28:08Z

Yeah. Please keep it as it gives more correct results.

rmuir · 2023-10-31T11:39:03Z

hmm i think i read @dweiss results in the wrong order... it seems like a fairly big difference? we actually regress scalar dotproduct for his cpu here. And I assume the vector case behaves the same way?

if we want to avoid it on zen2 I am fine with it, if you tell me how to detect it. But i assume on zen3 things are fine, fma has lower latency.

uschindler · 2023-10-31T11:57:55Z

I will check on my Zen on policeman Jenkins later.

You can't get zen version or detect Ryzen CPUs easily. Please do not add cpuinfo parsing as this won't work on alternate platforms like Windows.

I am fine with having a slowdown for more correctness.

rmuir · 2023-10-31T12:14:02Z

You can't get zen version or detect Ryzen CPUs easily. Please do not add cpuinfo parsing as this won't work on alternate platforms like Windows.

We can just retrieve another flag to see it. openjdk already detects stuff like this and then sets flags accordingly, so we can infer it.

https://github.com/openjdk/jdk/blob/e05cafda78a37dbeb2df2edd791be19d22edaece/src/hotspot/cpu/x86/vm_version_x86.cpp#L1463

I will make openjdk "give up the goods" despite it trying to hide them :)

rmuir · 2023-10-31T12:26:42Z

we can also detect newer zen (e.g. zen4) by it having 512-bit vectors. so we just need to detect AMD vs Intel.

anyway, i'm fine with removing FMA from this PR completely actually, and revisiting its usage for vectors on AMD cpus. I'm not trying to make the code go slower.

uschindler · 2023-10-31T13:10:48Z

Please keep FMA for now and allow us benchmarking.

rmuir · 2023-10-31T13:14:00Z

This should solve @dweiss problem: f2be84f

It should also improve speed of vectorized case on AMD in the same way.

uschindler · 2023-10-31T14:04:37Z

This should solve @dweiss problem: f2be84f

It should also improve speed of vectorized case on AMD in the same way.

So basically this detects SSE4a, which is AMD only.

mikemccand · 2023-10-31T14:07:29Z

Thanks @rmuir!

Results from the nightly benchy box (128 core Ryzen Threadripper 3990X) -- WOOPS this is JDK 17. I'm running again with JDK 20:

main:

Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  1.206 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt    5  1.204 ±  0.001  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.851 ±  0.002  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  3.851 ±  0.003  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  2.516 ±  0.092  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  2.541 ±  0.001  ops/us

PR:

Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5  2.038 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt    5  2.030 ±  0.005  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5  3.999 ±  0.021  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  4.054 ±  0.003  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5  3.181 ±  0.740  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  3.241 ±  0.031  ops/us

Very nice speedups!

rmuir · 2023-11-03T22:01:44Z

AMD EPYC 9R14:
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

main:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.841 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.540 ± 0.001  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.539 ± 0.010  ops/us

patch:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.764 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.584 ± 0.002  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.562 ± 0.004  ops/us

uschindler · 2023-11-03T22:42:00Z

What are the results for vector API with this CPU?

rmuir · 2023-11-03T23:18:19Z

vector results for this AMD CPU are unchanged by this PR.

Float-relevant performance info from avxturbo.
This CPU doesn't downclock but 512-bit FMA is 2x as slow as 256-bit FMA, so i did some experiments...

Cores | ID                  | Description                       | OVRLP3 |  Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | avx128_fma_t        | 128-bit parallel DP FMAs          |  1.000 |  7402 |      1.42 |    3700 |        1.00
1     | avx256_fma_t        | 256-bit parallel DP FMAs          |  1.000 |  7402 |      1.42 |    3700 |        1.00
1     | avx512_fma_t        | 512-bit parallel DP FMAs          |  1.000 |  3700 |      1.42 |    3700 |        1.00

Float:
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.397 ± 0.205  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.226 ± 0.434  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.147 ± 0.394  ops/us

Float (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  11.234 ± 0.041  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  17.045 ± 0.436  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.876 ± 0.351  ops/us

Binary-relevant performance info from avxturbo:

Cores | ID                  | Description                       | OVRLP3 |  Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | avx128_imul         | 128-bit integer muls (vpmuldq)    |  1.000 |  1233 |      1.42 |    3700 |        1.00
1     | avx256_imul         | 256-bit integer muls (vpmuldq)    |  1.000 |  1233 |      1.42 |    3700 |        1.00
1     | avx512_imul         | 512-bit integer muls (vpmuldq)    |  1.000 |  1233 |      1.42 |    3700 |        1.00

Binary:
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15   8.769 ± 0.083  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  22.362 ± 0.054  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  18.080 ± 0.171  ops/us

Binary (512-bit vectors but disabling Intel-specific downclock-protection / doing 32-bit vpmul)
INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA enabled

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15  10.669 ± 0.242  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  21.148 ± 0.087  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  18.048 ± 0.142  ops/us

Binary (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15   8.773 ± 0.006  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  17.484 ± 0.022  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  14.930 ± 0.018  ops/us

rmuir · 2023-11-03T23:32:08Z

So you can see the difference in approach. Personally i prefer how this AMD AVX-512 works: that for some operations, the 512-bit variant just isn't any faster than the 256-bit variant, versus intel's approach of slowing down other things on the computer :)

rmuir · 2023-11-04T06:48:49Z

I tweaked the FMA logic for AMD cpus, to only avoid the high-latency scalar FMA where necessary. Should appease germans to get that extra ulp or whatever.

sysprops default to "auto" so you can override however you want, without fear of involving BigDecimal :)

I can test the intel and arm families in the same way and try to tighten it up tomorrow.

AMD Zen4: EPYC 9R14 (family 0x19)

Main:
Benchmark                                  (size)   Mode  Cnt   Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.842 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.497 ±  0.171  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.540 ±  0.002  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.441 ±  0.424  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.540 ±  0.008  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.655 ±  0.575  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.763 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.477 ± 0.168  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.583 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.438 ± 0.493  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.560 ± 0.009  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  15.778 ± 0.114  ops/us

AMD Zen3: EPYC 7R13 (family 0x19)

Main:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.982 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75  10.476 ± 0.026  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   3.246 ± 0.015  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  16.959 ± 0.480  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.298 ± 0.010  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  16.342 ± 0.508  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.344 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  10.445 ± 0.048  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.405 ± 0.006  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.486 ± 0.374  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.995 ± 0.002  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.374 ± 0.462  ops/us

AMD Zen2: EPYC 7R32 (family 0x17)

Main:
Benchmark                                   (size)   Mode  Cnt   Score    Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.922 ±  0.005  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75   8.519 ±  0.020  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   2.968 ±  0.020  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  15.950 ±  0.486  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.015 ±  0.012  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  15.894 ±  0.331  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.200 ± 0.005  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.520 ± 0.018  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.114 ± 0.021  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  15.671 ± 0.439  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.490 ± 0.030  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  15.189 ± 0.170  ops/us

mikemccand · 2023-11-04T09:58:02Z

Thank you @rmuir for doing all the crazy hard work to decode the actual capabilities of the bare metal hiding underneath the layers of abstraction under Panama Vector API @rmuir! I love the CONSTANTS approach.

versus intel's approach of slowing down other things on the computer :)

!!

uschindler

Hi @rmuir,
I polished the documentation of VectorUtil a bit, so people know the knobs how to enable the incubation module and how to tune FMA. This looks now fine to me.

The good thing with the three-state sysprop is: you can run benchmarks for testing newer CPUs easily without modifying the code.

uschindler · 2023-11-04T11:46:01Z

@rmuir: It would be nice if you could follow the community standard and merge this long PR with Github UI and squash it - thanks. I can do it for you if you like.

mikemccand · 2023-11-04T11:46:27Z

I tested on my now-ancient Zen2 beast3 (nightly benchmark) box (AMD Ryzen Threadripper 3990X 64-Core Processor), using JDK 21 (openjdk full version "21+35"), with command-line ./gradlew clean; ./gradlew -p lucene/benchmark-jmh assemble; java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar float -p size=1024.

[An aside: strangely, to test the PR, I normally download and apply the .diff or .patch using patch -p1 < X.diff/patch, but for this PR there are non-trivial (to me!) conflicts reported by patch. So instead I ran the suggested github command-line steps for merging, and got a clean applied version of this PR to run the benchy.]

main:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.176 ± 0.011  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  11.015 ± 0.029  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.870 ± 0.011  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  22.879 ± 0.407  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.604 ± 0.023  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  21.293 ± 0.289  ops/us

PR:

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.553 ± 0.009  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  10.995 ± 0.025  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   4.051 ± 0.029  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  22.887 ± 0.396  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.254 ± 0.008  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  21.238 ± 0.420  ops/us

uschindler · 2023-11-04T12:12:36Z

[An aside: strangely, to test the PR, I normally download and apply the .diff or .patch using patch -p1 < X.diff/patch, but for this PR there are non-trivial (to me!) conflicts reported by patch. So instead I ran the suggested github command-line steps for merging, and got a clean applied version of this PR to run the benchy.]

There are some bugs in Github since yesterday (they also 404 not found PRs for some time). Actually the patch is completely unuseable as it partly contains also merged information. The diff looks fine to me.

My recommendation: You can merge the PR into your branch using the command line provided by Github or - much easier - add Robert's repository as rmuir upsteam. I have the common repos by Robert, Chris already available in my git config, so it's simple to check them out and work directly on them.

rmuir · 2023-11-04T12:12:56Z

@rmuir: It would be nice if you could follow the community standard and merge this long PR with Github UI and squash it - thanks. I can do it for you if you like.

I am not done here yet, I want to benchmark and try to tighten the intel and arm models first too. At least do the best i can to get the best performance out of all of them.

whether to squash or not is my decision. Just like maybe the community standard is intellij, i use vim.

rmuir · 2023-11-04T14:21:50Z

Benchmarks for the intel cpus. There is one place i'd fix, if we could detect sapphire rapids and avoid scalar FMA. But i have no way to detect it based on what new features it has / what openjdk exposes at the moment. Otherwise performance is good.

Sapphire Rapids:

Main:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.871 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75  13.907 ± 0.266  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   4.275 ± 0.023  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  22.218 ± 0.759  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.819 ± 0.004  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  20.243 ± 0.352  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.650 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.799 ± 0.233  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.612 ± 0.012  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  23.300 ± 1.079  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.884 ± 0.004  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  20.449 ± 0.446  ops/us

Ice Lake:

Main:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.547 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector         1024  thrpt   75   9.842 ± 0.334  ops/us
VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   2.471 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  13.452 ± 0.455  ops/us
VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   1.749 ± 0.004  ops/us
VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  11.813 ± 0.456  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.528 ± 0.003  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   9.919 ± 0.345  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.314 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  13.137 ± 0.155  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.248 ± 0.025  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  11.920 ± 0.469  ops/us

Cascade Lake:

Main:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.578 ± 0.005  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.907 ± 0.095  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   1.742 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  13.935 ± 0.129  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   1.347 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  12.526 ± 0.132  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.641 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.823 ± 0.114  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.401 ± 0.014  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  13.874 ± 0.116  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.629 ± 0.016  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  12.462 ± 0.123  ops/us

Haswell:

Main:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.728 ± 0.005  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   6.781 ± 0.071  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   1.730 ± 0.034  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  10.603 ± 0.351  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   1.398 ± 0.060  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75   9.470 ± 0.286  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.199 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   6.775 ± 0.083  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   2.465 ± 0.017  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  10.410 ± 0.300  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.299 ± 0.005  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75   9.117 ± 0.118  ops/us

use a heuristic that may upset fanboys but is really practical and simple

rmuir · 2023-11-04T16:22:56Z

Here are the ARMs. I had to tweak ARM to use FMA more aggressively to fully utilize the gravitons. The problem there is just apple silicon, it is good we did not move forwards with benchmarks based solely on some macs. You may not like my detector, but I think it is quite practical and prevents slow execution.

Graviton 3

Main:
Benchmark                                  (size)   Mode  Cnt   Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.682 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   5.500 ±  0.004  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   2.411 ±  0.037  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.522 ±  0.234  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.169 ±  0.005  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75   8.632 ±  0.084  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.422 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   6.911 ± 0.039  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.751 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.498 ± 0.418  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.202 ± 0.007  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  10.795 ± 0.154  ops/us

Graviton 2

Main:
Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15  0.647 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  2.599 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15  1.430 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  6.192 ± 0.098  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15  1.194 ± 0.003  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  4.797 ± 0.088  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15  1.571 ±  0.001  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  5.408 ±  0.013  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15  2.055 ±  0.066  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  6.673 ±  0.260  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15  1.753 ±  0.001  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  6.179 ±  0.070  ops/us

Mac M1

Main:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.077 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.651 ± 0.032  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.606 ± 0.032  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.296 ± 0.268  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.197 ± 0.001  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.185 ± 0.099  ops/us

Patch:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   2.062 ± 0.006  ops/us
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.644 ± 0.030  ops/us
VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   4.273 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.110 ± 0.283  ops/us
VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.770 ± 0.007  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.184 ± 0.100  ops/us

uschindler · 2023-11-04T16:31:54Z

You may not like my detector, but I think it is quite practical and prevents slow execution.

The detector is funny, but it won't detect slow apple silicon if you run Linux on the Mac. But I agree it is ok.

It is good that we have the sysprops to enforce FMA or disable it, overriding default detection if needed. So on apple chips with Linux you can disable it. 👻

rmuir · 2023-11-04T18:17:13Z

It is good that we have the sysprops to enforce FMA or disable it, overriding default detection if needed. So on apple chips with Linux you can disable it. 👻

exactly. we can't detect all cases perfectly or predict the future. but I don't want this to be a hassle: and want things to be fast by default everywhere if at all possible (without complex logic). Hence the simple heuristic. If there is a problem with it, there's a workaround (sysprop).

rmuir · 2023-11-04T20:43:57Z

for transparency, this was my testing procedure. I did lots of other things such as poking around and running experiments too, but for the basics of "running benchmark across different instance types", it can all be easily automated with tools like ansible and run all in parallel. the question is, how to visualize the data?

# login
ssh -i robkeypair.pem ec2-user@<ip>
# disable system slowdowns
sudo grubby --remove-args="selinux=1 security=selinux quiet" --args="mitigations=0 random.trust_cpu=1 loglevel=7 selinux=0" --update-kernel=ALL && sudo reboot
# login again
ssh -i robkeypair.pem ec2-user@<ip>
-if x86
  # install packages for testing
  sudo yum install -y git g++ make
  # clone avx-turbo
  git clone [email protected]:travisdowns/avx-turbo.git
  # build avx-turbo
  cd avx-turbo; make
  # load msr module
  sudo modprobe msr
  # run avx-turbo
  sudo ./avx-turbo
  # examine results, look for any oddities
    # look at avx*_imul, avx*_fma, and avx*_fma_t.
    # check ratio of avx512_imul to avx256_imul and look at clock difference
    # check ratio of avx512_fma_t to avx256_fma_t and look at clock difference
    # check ratio of avx*_fma_t to avx*_fma (divided by 2 for HT)
  cd ..
  curl -f https://download.java.net/java/GA/jdk21.0.1/415e3f918a1f4062a0074a2794853d0d/12/GPL/openjdk-21.0.1_linux-x64_bin.tar.gz | tar -zxvf -
-else aarch64
  sudo yum install -y git
  curl -f https://download.java.net/java/GA/jdk21.0.1/415e3f918a1f4062a0074a2794853d0d/12/GPL/openjdk-21.0.1_linux-aarch64_bin.tar.gz | tar -zxvf -
-endif
# download java
# configure java (also in case i get disconnected)
echo 'export JAVA_HOME=/home/ec2-user/jdk-21.0.1' >> ~/.bashrc
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
# prevent benchmark interference from daemon
mkdir ~/.gradle
echo 'org.gradle.daemon=false' > ~/.gradle/gradle.properties
# clone lucene
git clone [email protected]:rmuir/lucene.git; cd lucene
# run benchmark (main)
./gradlew -p lucene/benchmark-jmh assemble
java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar float -p size=1024
# run benchmark (patch)
git checkout float_scalar_fma_unroll
./gradlew -p lucene/benchmark-jmh assemble
java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar float -p size=1024

rmuir · 2023-11-04T20:55:04Z

and yeah, the avx-turbo is measuring double precision when it "benches" FMA and we do float precision, i know. but its code already written and a nice non-java way to get the wanted info, and seems pretty in line with the results.

uschindler · 2023-11-04T22:18:51Z

Sorry, pressed wrong button. Reopened.

…ere possible (#12737) Co-authored-by: Uwe Schindler <[email protected]>

uschindler · 2023-11-05T09:15:42Z

Thanks for the hard benchmarking work! 🍻

Speed up vectorutil scalar methods, unroll properly, use fma where po…

b192064

…ssible

rmuir requested review from ChrisHegarty and uschindler October 31, 2023 03:41

rmuir commented Oct 31, 2023

View reviewed changes

clean up vector tails too, for consistency

93fed5f

uschindler approved these changes Oct 31, 2023

View reviewed changes

uschindler added the vector-based-search label Oct 31, 2023

detect AMD and don't use FMA there which causes slowdowns

f2be84f

rmuir and others added 2 commits November 3, 2023 16:41

clean up logic

4086abf

update logic for newer zen cores with lower latency FMA

555841c

rmuir added 2 commits November 3, 2023 19:47

add sysprop override: for uwe's use only

b702bdf

tighten up AMD logic to perfection

b65fb7a

Add documentation

503eebd

uschindler approved these changes Nov 4, 2023

View reviewed changes

greatly improve Graviton2, the only issue here is apple silicon...

6432330

use a heuristic that may upset fanboys but is really practical and simple

uschindler approved these changes Nov 4, 2023

View reviewed changes

uschindler closed this Nov 4, 2023

uschindler reopened this Nov 4, 2023

rmuir merged commit 40e55b0 into apache:main Nov 4, 2023
8 checks passed

asfgit pushed a commit that referenced this pull request Nov 4, 2023

Speed up vectorutil float scalar methods, unroll properly, use fma wh…

3a2ee7f

…ere possible (#12737) Co-authored-by: Uwe Schindler <[email protected]>

Speed up vectorutil float scalar methods, unroll properly, use fma where possible #12737

Speed up vectorutil float scalar methods, unroll properly, use fma where possible #12737

Conversation

rmuir commented Oct 31, 2023

rmuir commented Oct 31, 2023

rmuir commented Oct 31, 2023

rmuir Oct 31, 2023

Choose a reason for hiding this comment

rmuir commented Oct 31, 2023

rmuir commented Oct 31, 2023

uschindler left a comment

Choose a reason for hiding this comment

uschindler commented Oct 31, 2023 • edited Loading

uschindler commented Oct 31, 2023 • edited Loading

dweiss commented Oct 31, 2023

uschindler commented Oct 31, 2023

dweiss commented Oct 31, 2023

dweiss commented Oct 31, 2023

dweiss commented Oct 31, 2023

uschindler commented Oct 31, 2023

rmuir commented Oct 31, 2023

uschindler commented Oct 31, 2023

rmuir commented Oct 31, 2023

uschindler commented Oct 31, 2023

rmuir commented Oct 31, 2023

rmuir commented Oct 31, 2023

uschindler commented Oct 31, 2023

rmuir commented Oct 31, 2023

uschindler commented Oct 31, 2023

mikemccand commented Oct 31, 2023 • edited Loading

rmuir commented Nov 3, 2023

uschindler commented Nov 3, 2023

rmuir commented Nov 3, 2023

rmuir commented Nov 3, 2023

rmuir commented Nov 4, 2023

mikemccand commented Nov 4, 2023

uschindler left a comment

Choose a reason for hiding this comment

uschindler commented Nov 4, 2023 • edited Loading

mikemccand commented Nov 4, 2023

uschindler commented Nov 4, 2023 • edited Loading

rmuir commented Nov 4, 2023

rmuir commented Nov 4, 2023

rmuir commented Nov 4, 2023

uschindler commented Nov 4, 2023

rmuir commented Nov 4, 2023

rmuir commented Nov 4, 2023

rmuir commented Nov 4, 2023

uschindler commented Nov 4, 2023

uschindler commented Nov 5, 2023

uschindler commented Oct 31, 2023 •

edited

Loading

uschindler commented Oct 31, 2023 •

edited

Loading

mikemccand commented Oct 31, 2023 •

edited

Loading

uschindler commented Nov 4, 2023 •

edited

Loading

uschindler commented Nov 4, 2023 •

edited

Loading