Skip to content
This repository has been archived by the owner on Jan 26, 2024. It is now read-only.

clinfo hangs on configurations with two AMD GPU and open source rocm #148

Open
NTMan opened this issue Jul 21, 2022 · 4 comments
Open

clinfo hangs on configurations with two AMD GPU and open source rocm #148

NTMan opened this issue Jul 21, 2022 · 4 comments

Comments

@NTMan
Copy link

NTMan commented Jul 21, 2022

clinfo hangs in a cycle since it completely occupies one processor core. Same symptoms I observed when launch "DaVinci Resolve". On a desktop with a single Radeon 6900XT GPU, this problem does not occurs.

My configuration:
One GPU is internal in the RENOIR processor, and the other is a discrete AMD Radeon 6800M (It laptop ASUS G513QY)
In the BIOS there is no ability to turn off the integrated GPU in the processor, so there is no way to check this configuration with each GPU separately.

In the kernel log there is no error so it is most likely a user space issue, but I am not sure about it.

But when I forcibly terminate clinfo (press <Ctrl + C> until in the terminal returned typing) in the kernel log appears follow messages:
[ 1962.000909] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[ 1962.000912] amdgpu: Failed to evict process queues
[ 1962.000918] amdgpu: Failed to quiesce KFD
[ 1966.010395] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[ 1966.010406] amdgpu: Resetting wave fronts (cpsch) on dev 00000000b40e7982

I am used open source rocm stack from package rocm-opencl [1] which passed review and already pushed to official Fedora repository [2].

Output clinfo ended with line:
Max work group size (AMD) 1024
Full clinfo output you can find here [3]
Backtrace clinfo you can find here [4]

The clinfo developer says that the problem is deeper in rocm or kernel [5].

Versions:

# rpm -qa | grep clinfo
clinfo-3.0.21.02.21-3.fc36.x86_64

# rpm -qa | grep rocm
rocm-comgr-5.2.0-1.fc37.x86_64
hsakmt-1.0.6-23.rocm5.2.0.fc37.x86_64
rocm-runtime-5.2.0-1.fc37.x86_64
rocm-opencl-5.2.0-1.fc37.x86_64

[1] https://copr.fedorainfracloud.org/coprs/mystro256/rocm-opencl/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2090823
[3] https://pastebin.com/TR5zy30Z
[4] https://pastebin.com/wv5iGibi
[5] Oblomov/clinfo#81

@b-sumner
Copy link

Does the same happen with /opt/rocm/bin/clinfo?

@NTMan
Copy link
Author

NTMan commented Jul 21, 2022

Does the same happen with /opt/rocm/bin/clinfo?

Excuse me, but why clinfo should placed in /opt/rocm/bin ?

$ whereis clinfo 
clinfo: /usr/bin/clinfo /usr/share/man/man1/clinfo.1.gz

$ locate clinfo
/home/mikhail/clinfo-backtrace.txt
/usr/bin/clinfo
/usr/lib/debug/usr/bin/clinfo-3.0.21.02.21-3.fc36.x86_64.debug
/usr/share/doc/clinfo
/usr/share/doc/clinfo/README.md
/usr/share/licenses/clinfo
/usr/share/licenses/clinfo/LICENSE
/usr/share/licenses/clinfo/legalcode.txt
/usr/share/man/man1/clinfo.1.gz
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/clinfo.c
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/error.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/ext.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/info_loc.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/info_ret.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/opt_out.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/strbuf.h
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/redhat-linux-build/tools/clinfo
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/tools/clinfo
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/tools/clinfo/clinfo.cpp

@b-sumner
Copy link

One reason is that AMD wrote its own clinfo back in the days of OpenCL 1.0, long before any other implementations appeared on github and were picked up by the distros, and has maintained it since.

@NTMan
Copy link
Author

NTMan commented Jul 21, 2022

One reason is that AMD wrote its own clinfo back in the days of OpenCL 1.0, long before any other implementations appeared on github and were picked up by the distros, and has maintained it since.

"DaVinci Resolve" has same symptoms (looks like infinite loop which eat 100% CPU)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants