-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL codegen regression in ROCm 6.2.0 vs. 6.1.3 #151
Comments
How to repro:
If you see lines with "EE" an error is detected. Example error run: (ROCm 6.2.0 or 6.2.1)
And correct run: (ROCm 6.1.3)
(was stopped with CTRL-C from the terminal) To reset state between runs, simply remove the folder ./1257787 where progress is stored. The above "black box" reproducer could be used to git bisect between the two versions good 669db88 and bad 26466ce , which is what I would like to do in fact, if only I knew how to build ROCm/llvm-project/ (someby could point me to build instructions please?) |
I attach the ISA produced for the tailMul kernel by ROCm 6.1.3 and 6.2.0: The tailMul kernel is quite similar to the tailSquare kernel, which is compiled correctly by both ROCm 6.1.3 and 6.2.0. This can be used to rule out some suspected programmer errors, which if present should affect both tailMul and tailSquare in the 6.2.0 compilation. These dumps were produced based on this GpuOwl commit: https://github.com/preda/gpuowl/tree/6121cc7d5eff87e4d21faac287c1d52122962d59 |
I have also received reports of the same issue happening on Mi300, so it's not RadeonProVII specific either. |
I attempted a git bisect. I understand that I need to build 3 things in order: llvm, device-libs, comgr. The git bisect is made really difficult because it's hard to get one point across the projects (llvm, device-libs, comgr), given that the latter two (device-libs, comgr) have only recently been integrated in the same git repository as ROCm llvm. So basically it's really hard to even checkout a set of the 3 libs that even compile together. This is what I have thus far:
And at this point I'm stuck, I can't build df06594 |
The puzzle has been cleared! I found the problem. I have declared in my own OpenCL source a function called That was computing a complex multiply-add. At some point between ROCm 6.1.3 and ROCm 6.2.0, it appears a function with the same name and signature was added and made available "implicitly" in OpenCL source code. Not only that, but made available with higher priority than my own function with the same signature and all that without as much as a warning at OpenCL compilation. Basically, it changed the behavior of my own complex mad() behind my back and without warning! This is the fix in my own code: I think this bug should be moved to the appropriate sub-project (amd/device-libs) and tracked there. |
Extracted as a cleaned-up issue report at #152 . Closing here. |
Using a Radeon Pro VII GPU on Linux kernel 6.8.12.
Using the GpuOwl OpenCL project
https://github.com/preda/gpuowl
https://github.com/preda/gpuowl/releases/tag/v%2Fprpll%2F0.13
I see what looks like a codegen regression manifesting itself for the first time in ROCm 6.2.0
Specifically, compiling with [any version up to] ROCm 6.1.3 works fine.
(AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.3 24193 669db88 )
Compiling with ROCm 6.2.0 and 6.2.1 is broken.
(
"AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.2.0 24292 26466ce
"AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.2.1 24332 8fc3143"
)
The regression is deterministic (it manifest itself every time, in exactly the same way, 100% reproducible).
The regression seems rather tricky to trigger: among the many OpenCL kernels that are compiled by the GpuOwl application, only one is affected (tailMul https://github.com/preda/gpuowl/blob/master/src/cl/tailmul.cl), and it's a rather complex kernel.
For a while now I've been looking for programming errors (i.e. "my-side issues") in this kernel, but I couldn't find a "programmer error" explanation for the regression. I am trying to get more info, I'm still investigating. But at this point, to me it seems that the evidence points towards a codegen regression, and that's why I'm opening this initial brief issue report to which I'm planning to add more info as I get it.
Here's some bits of info:
The concerned kernel https://github.com/preda/gpuowl/blob/master/src/cl/tailmul.cl
works correctly in ROCm 6.1.3. There is a similar kernel https://github.com/preda/gpuowl/blob/master/src/cl/tailsquare.cl
which works correctly in 6.1.3, 6.2.0, 6.2.1. There is no variation in the behavior of the "broken" kernel, which seems to indicate it's not about reading uninitialized data, LDS races, global access races, and other non-deterministic behavior.
I plan to add more info here, in particular detailed reproducer and ISA dumps for the good/bad kernel compilations.
The text was updated successfully, but these errors were encountered: