Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse and AVX2 #172

Open
syzygy1 opened this issue Oct 30, 2020 · 33 comments
Open

Sparse and AVX2 #172

syzygy1 opened this issue Oct 30, 2020 · 33 comments

Comments

@syzygy1
Copy link
Owner

syzygy1 commented Oct 30, 2020

On my AVX2 laptop, sparse multiplication now turns out to be slower than the non-sparse multiplication. I suspect that this is not the case on some other AVX2 CPUs, in particular Zen 1.

I have therefore added a compilation option.
To compile with sparse multiplication: make -j pgo sparse=yes
To compile without sparse multiplication: make -j pgo sparse=no

By default "sparse=yes" except for AVX2 targets (including BMI2, VNNI, AVX512).

If it is clear that "sparse=no" is still faster on Zen 1 or on other CPUs with AVX2, I can make it the default on those CPUs. I cannot test this myself, so if anyone is willing to try sparse=yes/no on Zen 1 or other CPUs, that would be very welcome.

It would also be interesting to know if sparse=no is faster on any non-AVX2 CPUs.

@syzygy1
Copy link
Owner Author

syzygy1 commented Oct 30, 2020

The number of search threads might also have an impact on which is faster...

@JavaMast
Copy link

Screenshot_185

@JavaMast
Copy link

311020_1 = Correctly display castling rights for Chess960.
311020_2 = Improve non-sparse multiplication.

@JavaMast
Copy link

Screenshot_186

*Ryzen 3900X @3.8 GHz

@JavaMast
Copy link

Screenshot_187

@syzygy1
Copy link
Owner Author

syzygy1 commented Oct 31, 2020

Thanks, so sparse AVX2 is still clearly better on AMD. Were these all tested on Ryzen 3900X?

@JavaMast
Copy link

Yes, all on my Ryzen 3900X.

I hope to get tests on another CPUs soon.

@JavaMast
Copy link

Intel i5 760 (Nehalem), 2,95 GHz

bench 16 1 13 default depth NNUE
bench 16 1 13 default depth Pure

bench 16 3 13 default depth NNUE

@JavaMast
Copy link

Athlon_x4_870K
Athlon_x4_870K

@syzygy1
Copy link
Owner Author

syzygy1 commented Oct 31, 2020

Thanks again!

So on Nehalem, no_sparse is now better than sparse, which was the other way around before the improvement.
On my Sandybridge PC, no_sparse is improved, but sparse is still better.
So there is no clear Intel rule here.

The Athlon resutls have a pretty high variance, but seem to suggest sparse is better.

@JavaMast
Copy link

Intel Core i5-7600K
Intel Core i5-7600K_NNUE
Intel Core i5-7600K_Pure

@JavaMast
Copy link

JavaMast commented Nov 1, 2020

Intel 6800k

Intel 6800k 1
Intel 6800k 2
Intel 6800k 3
Intel 6800k 4
Intel 6800k 5
Intel 6800k 6

@JavaMast
Copy link

JavaMast commented Nov 1, 2020

i7-7700HQ @2.80GHz

i7-7700HQ

@syzygy1
Copy link
Owner Author

syzygy1 commented Nov 1, 2020

Thanks.
So sparse=no is now better on Intel AVX2.
For SSE2, sparse=yes is better. (I have now improved non-sparse for SSE2, but it still doesn't get close to sparse.)
For SSSE3/SSE41, there is no clear winner on Intel.

On AMD, sparse=yes seems better.

@JavaMast
Copy link

JavaMast commented Nov 1, 2020

It looks like this.

I am very confused by the results on Athlon 870K - today more tests were carried out and the variance has become even greater.

Athlon_x4_870K 2

Was tested with network nn-cb26f10b1fd9.nnue

@syzygy1
Copy link
Owner Author

syzygy1 commented Nov 1, 2020

Maybe the cpu is overheating and then throttles down?

@AlexB123
Copy link

AlexB123 commented Nov 2, 2020

It looks like this.

I am very confused by the results on Athlon 870K - today more tests were carried out and the variance has become even greater.

Athlon_x4_870K 2

Was tested with network nn-cb26f10b1fd9.nnue

Hello guys! Above test was made on my PC, same as below speed tests. Recently my brother made a small update on my PS, and he didn't tell me that now i have Turbo boost, so now i have to learn how to switch the Turbo boost off (lol). I've repeated speed test with "Warm up CPU", speed looks more less correct.
Speed
Speed2

@syzygy1
Copy link
Owner Author

syzygy1 commented Nov 3, 2020

@AlexB123
Which CPU is that?
It seems non-sparse might be a little bit better with 1 thread (except for SSE2, which is expected) but loses to sparse with multiple threads.
Non-sparse probably uses a bit more power and therefore increases CPU temps more.

@JavaMast
Copy link

JavaMast commented Nov 3, 2020

@syzygy1
This is Athlon 870K

@syzygy1
Copy link
Owner Author

syzygy1 commented Nov 3, 2020

Ah, I see now.

@JavaMast
Copy link

Looks like no_sparse is faster on new AMD CPUs
AMD RYZEN 9 5950x
Screenshot 2020-11-16 12 40 12

==================
Hope to see BMI2 builds in speed test soon.

@JavaMast
Copy link

JavaMast commented Nov 16, 2020

AMD RYZEN 9 5950x
Screenshot 2020-11-16 15 08 36

Screenshot 2020-11-16 15 19 21

@JavaMast
Copy link

After "Updated to "AVX512, AVX2 and SSSE3 speedups"."
Ryzen 3900X

Screenshot_232

@syzygy1
Copy link
Owner Author

syzygy1 commented Dec 17, 2020

What is the difference between SSSE3.exe and SSSE3_popcnt_mingw_10.exe ?

@syzygy1
Copy link
Owner Author

syzygy1 commented Dec 17, 2020

I think the fact that no_sparse now beats sparse on Zen 3 shows that AMD has improved their AVX2 implementation in Zen 3.

@JavaMast
Copy link

What is the difference between SSSE3.exe and SSSE3_popcnt_mingw_10.exe ?

SSSE3 and SSSE3_sparse is 32-bit builds (compiled in MinGW i686-8.1.0-posix-dwarf-rt_v6-rev0)

@syzygy1
Copy link
Owner Author

syzygy1 commented Dec 18, 2020

OK, so for 64-bit SSSE3 on Zen 2, sparse=yes is still faster than sparse=no.

But it seems sparse=no is now faster than sparse=yes for AVX2 on Zen 2. I thought sparse=yes was clearly faster before the AVX2 speed up. This suggests that sparse=no is now faster on all CPUs with AVX2.

@syzygy1
Copy link
Owner Author

syzygy1 commented Dec 19, 2020

I just tested a Ryzen 4500U laptop and also found that sparse=yes was faster than sparse=no before the AVX2 speedup patch and is now slower.

@JavaMast
Copy link

Hello!

Sparse=no faster for all builds except SSE2 on Core i5 - 11400f.

AVX512_VNNI fastest

Screenshot_350

@JavaMast
Copy link

Just curious, on my i5 11400f Cish is faster with Pure mode:

Screenshot_369
Screenshot_371

Only for AVX2 builds and higher. Not for SSE builds.
On Ryzen 3900X - NNUE is still faster than Pure.

@syzygy1
Copy link
Owner Author

syzygy1 commented Apr 24, 2021

Pure being fasted is pretty nice. Is it also stronger?

@JavaMast
Copy link

No, Hybrid still stronger

BMI2
10+0,1
concurrency 6

Score of Cfish_x64_120421_ELTO_BMI2 vs Cfish_x64_130421_ELTO_BMI2_Pure: 668 - 521 - 6564 [0.509]
... Cfish_x64_120421_ELTO_BMI2 playing White: 520 - 138 - 3219 [0.549] 3877
... Cfish_x64_120421_ELTO_BMI2 playing Black: 148 - 383 - 3345 [0.470] 3876
... White vs Black: 903 - 286 - 6564 [0.540] 7753
Elo difference: 6.6 +/- 3.0, LOS: 100.0 %, DrawRatio: 84.7 %
7758 of 20000 games finished.

AVX512_VNNI
10+0,1
concurrency 5

Score of Cfish_x64_120421_ELTO_AVX512___VNNI vs Cfish_x64_130421_ELTO_AVX512_VNNI_Pure: 527 - 507 - 6038 [0.501]
... Cfish_x64_120421_ELTO_AVX512___VNNI playing White: 406 - 119 - 3011 [0.541] 3536
... Cfish_x64_120421_ELTO_AVX512___VNNI playing Black: 121 - 388 - 3027 [0.462] 3536
... White vs Black: 794 - 240 - 6038 [0.539] 7072
Elo difference: 1.0 +/- 3.1, LOS: 73.3 %, DrawRatio: 85.4 %
7076 of 20000 games finished.

@JavaMast
Copy link

@syzygy1
did you know how much Cfish faster on an old CPUs?
My friend with Phenom II x6 1100T (SSE2 build compatible) told me that Cfish is 2 times faster than Stockfish...
On my i5-11400f it is "only" 50% faster
Screenshot_136

even x32 build is faster

Screenshot_137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants