-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up vectorutil float scalar methods, unroll properly, use fma where possible #12737
Conversation
with all the data dependencies removed, i also gave at least one stab trying to see if i could trick the compiler into using packed instructions instead of single floats... would be awesome if we could just delete our vector code for float!!! didn't succeed but deserves a few more tries i think. |
e.g. for dotproduct case, with this patch, despite there being no data dependencies, compiler literally does 4 |
float acc2 = 0; | ||
float acc3 = 0; | ||
float acc4 = 0; | ||
int upperBound = a.length & ~(4 - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is from the javadocs of Species.loopBound
of vector api where width is a power of 2. I used it in these functions for consistency, and because i assume it means the compiler will do a good job. we could maybe put in a static method for other places doing crap like this (e.g. stringhelper's murmurhash) as a followup? I'm guessing any other places do it ad-hoc like what was here before. I wanted to keep this PR minimally invasive though.
I tried naively writing the logic like this with a couple N (8, 16, 32,etc) with FMA both off and on to see if I can baby this compiler to vectorize, nope, nothing. I don't think autovectorization works except for BitSet :)
|
Also it was just previously confusing to see stuff like vector benchmark results with 128-bit arm vectors going 8x faster than 32-bit floats which makes no logical sense. e.g. with this change, neon-128 dot product is 3.95x faster than 32-bit float dot product. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. Thanks for moving the detection method to constants class.
I was thinking about this, too, but I was afraid that a non local constant (and public) is not treated the same by hotspot.
One thing: we should benchmark this in Lucene 9.x with java 11, too. I want to make sure we get same or similar speedup there. Not that there will be a regression. |
Did you also benchmark this PR with java 17? Your benchmark does not say which version it uses. |
Nice. I checked with AMD Ryzen Threadripper 3970X. Note it's actually slightly faster when not using FMA...
|
@dweiss Are you sure the |
JMH states that:
and I've always relied on that. Here's an explicit version though (on main, JDK17) - it is consistent with the previous output:
|
Out of curiosity I checked (by sysouting the status of the constant): -XX:+-UseFMA is passed to forked JVMs, so these incantations of jmh are equivalent:
|
Here's how JMH copies runner VM's input arguments -
You're right that it's kind of vague what that bean actually returns. In this case though, it works. |
Thanks, great. By the way if you run the incubator impl it prints the FMA status as log message. Just the scalar one doesn't. |
@dweiss thanks for testing, I expect this on AMD. but "slightly" is very very slight, i think it is best to still use FMA here. https://www.reddit.com/r/cpp/comments/xtoj93/comment/iqrmmuo/ |
Yeah. Please keep it as it gives more correct results. |
hmm i think i read @dweiss results in the wrong order... it seems like a fairly big difference? we actually regress scalar dotproduct for his cpu here. And I assume the vector case behaves the same way? if we want to avoid it on zen2 I am fine with it, if you tell me how to detect it. But i assume on zen3 things are fine, fma has lower latency. |
I will check on my Zen on policeman Jenkins later. You can't get zen version or detect Ryzen CPUs easily. Please do not add cpuinfo parsing as this won't work on alternate platforms like Windows. I am fine with having a slowdown for more correctness. |
We can just retrieve another flag to see it. openjdk already detects stuff like this and then sets flags accordingly, so we can infer it. I will make openjdk "give up the goods" despite it trying to hide them :) |
we can also detect newer zen (e.g. zen4) by it having 512-bit vectors. so we just need to detect AMD vs Intel. anyway, i'm fine with removing FMA from this PR completely actually, and revisiting its usage for vectors on AMD cpus. I'm not trying to make the code go slower. |
Please keep FMA for now and allow us benchmarking. |
Thanks @rmuir! Results from the nightly benchy box (128 core Ryzen Threadripper 3990X) -- WOOPS this is JDK 17. I'm running again with JDK 20:
Very nice speedups! |
AMD EPYC 9R14: main:
patch:
|
What are the results for vector API with this CPU? |
vector results for this AMD CPU are unchanged by this PR. Float-relevant performance info from avxturbo.
Float:
Float (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
Binary-relevant performance info from avxturbo:
Binary:
Binary (512-bit vectors but disabling Intel-specific downclock-protection / doing 32-bit vpmul)
Binary (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
|
So you can see the difference in approach. Personally i prefer how this AMD AVX-512 works: that for some operations, the 512-bit variant just isn't any faster than the 256-bit variant, versus intel's approach of slowing down other things on the computer :) |
I tweaked the FMA logic for AMD cpus, to only avoid the high-latency scalar FMA where necessary. Should appease germans to get that extra ulp or whatever. sysprops default to "auto" so you can override however you want, without fear of involving BigDecimal :) I can test the intel and arm families in the same way and try to tighten it up tomorrow. AMD Zen4: EPYC 9R14 (family 0x19)
AMD Zen3: EPYC 7R13 (family 0x19)
AMD Zen2: EPYC 7R32 (family 0x17)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rmuir,
I polished the documentation of VectorUtil
a bit, so people know the knobs how to enable the incubation module and how to tune FMA. This looks now fine to me.
The good thing with the three-state sysprop is: you can run benchmarks for testing newer CPUs easily without modifying the code.
@rmuir: It would be nice if you could follow the community standard and merge this long PR with Github UI and squash it - thanks. I can do it for you if you like. |
I tested on my now-ancient Zen2 beast3 (nightly benchmark) box ( [An aside: strangely, to test the PR, I normally download and apply the
PR:
|
There are some bugs in Github since yesterday (they also 404 not found PRs for some time). Actually the patch is completely unuseable as it partly contains also merged information. The diff looks fine to me. My recommendation: You can merge the PR into your branch using the command line provided by Github or - much easier - add Robert's repository as rmuir upsteam. I have the common repos by Robert, Chris already available in my git config, so it's simple to check them out and work directly on them. |
I am not done here yet, I want to benchmark and try to tighten the intel and arm models first too. At least do the best i can to get the best performance out of all of them. whether to squash or not is my decision. Just like maybe the community standard is intellij, i use vim. |
Benchmarks for the intel cpus. There is one place i'd fix, if we could detect sapphire rapids and avoid scalar FMA. But i have no way to detect it based on what new features it has / what openjdk exposes at the moment. Otherwise performance is good. Sapphire Rapids:
Ice Lake:
Cascade Lake:
Haswell:
|
use a heuristic that may upset fanboys but is really practical and simple
Here are the ARMs. I had to tweak ARM to use FMA more aggressively to fully utilize the gravitons. The problem there is just apple silicon, it is good we did not move forwards with benchmarks based solely on some macs. You may not like my detector, but I think it is quite practical and prevents slow execution. Graviton 3
Graviton 2
Mac M1
|
The detector is funny, but it won't detect slow apple silicon if you run Linux on the Mac. But I agree it is ok. It is good that we have the sysprops to enforce FMA or disable it, overriding default detection if needed. So on apple chips with Linux you can disable it. 👻 |
exactly. we can't detect all cases perfectly or predict the future. but I don't want this to be a hassle: and want things to be fast by default everywhere if at all possible (without complex logic). Hence the simple heuristic. If there is a problem with it, there's a workaround (sysprop). |
for transparency, this was my testing procedure. I did lots of other things such as poking around and running experiments too, but for the basics of "running benchmark across different instance types", it can all be easily automated with tools like ansible and run all in parallel. the question is, how to visualize the data?
|
and yeah, the |
Sorry, pressed wrong button. Reopened. |
…ere possible (#12737) Co-authored-by: Uwe Schindler <[email protected]>
Thanks for the hard benchmarking work! 🍻 |
Apply same logic to float scalar functions as we have for vector functions. Really doing the same instructions after all, just different width. So they shouldn't be so crazy different...
The scalar functions currently have various ad-hoc unrolling that doesn't keep the CPU's FPUs properly busy, and have data dependencies and stuff that clogs everything up and prevents parallelization: this fixes that.
This gives e.g. 2X speedup to cosine on both x86 and arm, but all the functions get speedups of some sort on both architectures. FMA is used on x86 where available, to really keep the cpu busy, same as the vector case. It is not used on ARM.
main (skylake):
patch (skylake): Good speedup across the board, e.g. 2x for cosine
patch (skylake, -XX:-UseFMA): Still decent speedup even without using FMA, just not as much
Main (arm):
Patch (arm): Good speedups, 2x for cosine). FMA is not used.