Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL codegen regression in ROCm 6.2.0 vs. 6.1.3 #151

Closed
preda opened this issue Aug 27, 2024 · 6 comments
Closed

OpenCL codegen regression in ROCm 6.2.0 vs. 6.1.3 #151

preda opened this issue Aug 27, 2024 · 6 comments

Comments

@preda
Copy link

preda commented Aug 27, 2024

Using a Radeon Pro VII GPU on Linux kernel 6.8.12.

Using the GpuOwl OpenCL project
https://github.com/preda/gpuowl
https://github.com/preda/gpuowl/releases/tag/v%2Fprpll%2F0.13

I see what looks like a codegen regression manifesting itself for the first time in ROCm 6.2.0

Specifically, compiling with [any version up to] ROCm 6.1.3 works fine.
(AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.3 24193 669db88 )

Compiling with ROCm 6.2.0 and 6.2.1 is broken.
(
"AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.2.0 24292 26466ce
"AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.2.1 24332 8fc3143"
)

The regression is deterministic (it manifest itself every time, in exactly the same way, 100% reproducible).

The regression seems rather tricky to trigger: among the many OpenCL kernels that are compiled by the GpuOwl application, only one is affected (tailMul https://github.com/preda/gpuowl/blob/master/src/cl/tailmul.cl), and it's a rather complex kernel.

For a while now I've been looking for programming errors (i.e. "my-side issues") in this kernel, but I couldn't find a "programmer error" explanation for the regression. I am trying to get more info, I'm still investigating. But at this point, to me it seems that the evidence points towards a codegen regression, and that's why I'm opening this initial brief issue report to which I'm planning to add more info as I get it.

Here's some bits of info:
The concerned kernel https://github.com/preda/gpuowl/blob/master/src/cl/tailmul.cl
works correctly in ROCm 6.1.3. There is a similar kernel https://github.com/preda/gpuowl/blob/master/src/cl/tailsquare.cl
which works correctly in 6.1.3, 6.2.0, 6.2.1. There is no variation in the behavior of the "broken" kernel, which seems to indicate it's not about reading uninitialized data, LDS races, global access races, and other non-deterministic behavior.

I plan to add more info here, in particular detailed reproducer and ISA dumps for the good/bad kernel compilations.

@preda
Copy link
Author

preda commented Aug 27, 2024

How to repro:

  1. checkout GpuOwl source code https://github.com/preda/gpuowl
  2. build by running the ./m.sh script in the source folder (or using make DEBUG=1)
  3. run ./prpll-debug -prp 1257787

If you see lines with "EE" an error is detected. Example error run: (ROCm 6.2.0 or 6.2.1)

~/gpuowl-master$ ./prpll-debug -prp 1257787
20240827 16:52:09  PRPLL 0.13-5-gf78800b starting
20240827 16:52:09  config: -prp 1257787 
20240827 16:52:09  device 0, OpenCL 3625.0 (HSA1.1,LC), unique id '6fac68e1732c7315'
20240827 16:52:09 1257787 config: 
20240827 16:52:09 1257787 FFT: 256K 256:2:256:2:0 (4.80 bpw)
20240827 16:52:09 1257787 Using long carry!
20240827 16:52:13 1257787 OK         0 on-load: blockSize 1000, 0000000000000003
20240827 16:52:13 1257787 Proof of power 7 requires about 0.0GB of disk space
20240827 16:52:16 1257787 EE      2000 46c7ab6803e1a365 1181 ETA 00:25; Z=22818791611 (avg 22818791610.6) 1 errors
20240827 16:52:18 1257787 OK         0 on-load: blockSize 1000, 0000000000000003
20240827 16:52:18 1257787 Proof of power 7 requires about 0.0GB of disk space
20240827 16:52:21 1257787 EE      2000 46c7ab6803e1a365 1266 ETA 00:26; Z=5706127328 (avg 14362102830.7) 2 errors
20240827 16:52:21 1257787 Consistent error 46c7ab6803e1a365, will stop.
20240827 16:52:21  Exception "consistent error"
20240827 16:52:21  Bye

And correct run: (ROCm 6.1.3)

~/gpuowl-master$ ./prpll-debug -prp 1257787
20240827 16:59:11  PRPLL 0.13-5-gf78800b starting
20240827 16:59:11  config: -prp 1257787 
20240827 16:59:11  device 0, OpenCL 3614.0 (HSA1.1,LC), unique id '6fac68e1732c7315'
20240827 16:59:12 1257787 config: 
20240827 16:59:12 1257787 FFT: 256K 256:2:256:2:0 (4.80 bpw)
20240827 16:59:12 1257787 Using long carry!
20240827 16:59:15 1257787 OK         0 on-load: blockSize 1000, 0000000000000003
20240827 16:59:15 1257787 Proof of power 7 requires about 0.0GB of disk space
20240827 16:59:19 1257787 OK      2000 46c7ab6803e1a365 1356 ETA 00:28; Z=26585155752 (avg 26585155751.5)
^C20240827 16:59:22 1257787 Stopping, please wait..
20240827 16:59:23 1257787 OK      5000 cbcf6fb08d826fd6 1124 ETA 00:23; Z=27512112379 (avg 26739777374.1)
20240827 16:59:23  Exception "stop requested"
20240827 16:59:23  Bye

(was stopped with CTRL-C from the terminal)

To reset state between runs, simply remove the folder ./1257787 where progress is stored.

The above "black box" reproducer could be used to git bisect between the two versions good 669db88 and bad 26466ce , which is what I would like to do in fact, if only I knew how to build ROCm/llvm-project/ (someby could point me to build instructions please?)

@preda
Copy link
Author

preda commented Aug 27, 2024

I attach the ISA produced for the tailMul kernel by ROCm 6.1.3 and 6.2.0:
tailmul-bad.txt
tailmul-good.txt

The tailMul kernel is quite similar to the tailSquare kernel, which is compiled correctly by both ROCm 6.1.3 and 6.2.0. This can be used to rule out some suspected programmer errors, which if present should affect both tailMul and tailSquare in the 6.2.0 compilation.
tailsquare-6.1.3.txt
tailsquare-6.2.0.txt

These dumps were produced based on this GpuOwl commit: https://github.com/preda/gpuowl/tree/6121cc7d5eff87e4d21faac287c1d52122962d59
(the most recent commit as of now)

@preda
Copy link
Author

preda commented Aug 27, 2024

I have also received reports of the same issue happening on Mi300, so it's not RadeonProVII specific either.

@preda
Copy link
Author

preda commented Aug 29, 2024

I attempted a git bisect.

I understand that I need to build 3 things in order: llvm, device-libs, comgr. The git bisect is made really difficult because it's hard to get one point across the projects (llvm, device-libs, comgr), given that the latter two (device-libs, comgr) have only recently been integrated in the same git repository as ROCm llvm. So basically it's really hard to even checkout a set of the 3 libs that even compile together.

This is what I have thus far:

# bad: [26466ce804ac523b398608f17388eb6d605a3f09] [SLP]Improve/fix extracts calculations for non-power-of-2 elements.
# good: [669db884972e769450470020c06a6f132a8a065b] Revert "[Comgr] Add -relink-builtin-bitcode-postop to device library linking"
git bisect start '26466ce804ac523b398608f17388eb6d605a3f09' '669db884972e769450470020c06a6f132a8a065b'
# good: [59cf9c8632752d974b1f8f072666001d919a97a3] AMDGPU: Set max supported div/rem size to 64 (#80669)
git bisect good 59cf9c8632752d974b1f8f072666001d919a97a3
# bad: [b238bd62d3b4cd4c50c15b06ee9274b56733556c] merge main into amd-stg-open

And at this point I'm stuck, I can't build df06594

@preda
Copy link
Author

preda commented Aug 30, 2024

The puzzle has been cleared! I found the problem.

I have declared in my own OpenCL source a function called
double2 mad(double2 a, double2 b, double2 c);

That was computing a complex multiply-add.

At some point between ROCm 6.1.3 and ROCm 6.2.0, it appears a function with the same name and signature was added and made available "implicitly" in OpenCL source code. Not only that, but made available with higher priority than my own function with the same signature and all that without as much as a warning at OpenCL compilation. Basically, it changed the behavior of my own complex mad() behind my back and without warning!

This is the fix in my own code:
preda/gpuowl@af8db20#diff-faa0ddf7b0583d204176e0df48df07a9bf060d06ed803d666c03cdb15bbf550d

I think this bug should be moved to the appropriate sub-project (amd/device-libs) and tracked there.

@preda
Copy link
Author

preda commented Aug 30, 2024

Extracted as a cleaned-up issue report at #152 . Closing here.

@preda preda closed this as completed Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant