-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When could you support AMD Zen4 arch? #770
Comments
Zen4 is already support in AMD's fork of BLIS. We're in contact with AMD on coordinating how best to back-port these changes to BLIS master. |
Hi. I've conducted some experiments using scripts from https://github.com/flame/blis/blob/master/docs/Performance.md and AMD's fork of BLIS. I tested only GEMM and only in multithread mode, as https://github.com/amd/blis/tree/master/test/3 output is not compatible with https://github.com/flame/blis/tree/master/test/3 , but this test was enough for initial needs. My setup:
Commands executed:
Comments: AMD fork of BLIS significantly outperforms all other libraries on AMD Ryzen 9 7950X3D with Zen4 kernels (up to 2x). Vanilla BLIS is on par with OpenBLAS, but slower than MKL. There is a performance drop in MKL library for some sizes, but it looks like a fluke (it disappears for larger sizes). When checking gemm for larger matrices (like 6000*6000) performance was the same for all 4 libraries (supposedly due to memory bottleneck on my system). |
@AngryLoki Thank you for taking the time to gather, visualize, and share these performance results! Don't worry; a proper PS: Please feel free to keep up with us in our Discord server, if you haven't already joined! 😄 |
@AngryLoki thank you for this information. I am curious, did you also test AMD/blis compiled with AOCC? I've been experimenting with it on my system (Gentoo AMD 7840U) and it's performing well on certain tasks. |
@HaukurPall , checked sgemm (M=N=K) with gcc 13.2.1 (+full lto), clang 17.0.6, AOCC and rocm-llvm-alt. Results are the same, almost the same. I checked the code of AOCC and unfortunately I don't see any specific optimizations... AMD just shipped vanilla precompiled Clang and included some ROCm-related fixed (to make it work, not for optimization). Also they added ROCm/llvm-project@0272bec - if you attempt to use Regarding my previous tests, I checked my approach more carefully and found few misses from my side:
|
BLIS is usually pretty insensitive to compiler since most of the work happens in the inline assembly kernels.
I consider this a good thing since LLVM (and to fair other compilers too) really make a hash of C or intrinsics kernels due to a combination of poor register allocation and instruction ordering. Glad to see that AOCL-BLIS is performing well for you though. As we work with AMD to backport their changes BLIS will catch up. |
@AngryLoki thank you so much for this, this answers a lot of questions! |
No description provided.
The text was updated successfully, but these errors were encountered: