Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 #180

Open
madscientist159 opened this issue Jul 7, 2019 · 5 comments

Comments

@madscientist159
Copy link

madscientist159 commented Jul 7, 2019

When using QEMU vfio (Kernel 5.1, QEMU 4 GIT master) with VFIO PCIe passthrough and an AMD GPU (amdgpu driver), the test machine hard locks and GUARD entries are created for various slices (e.g. EQ0, EQ1, etc.) on subsequent reboot, gradually removing all functional slices from the system:

 10.82404|ISTEP  6. 9 - host_gard
 12.42766|================================================
 12.44729|Error reported by prdf (0xE500) PLID 0x9000009E
 12.44730|  PRD Signature            : 0x60000 0xC6D10010
 12.47015|  Signature Description    : pu.ex:k0:n0:s0:p00:c0 (L2FIR[16]) Cache line inhibited hit cacheable space
 12.49825|  UserData1   : 0x0006000000000101
 12.49826|  UserData2   : 0xc6d1001000000000
 12.49827|------------------------------------------------
 12.51947|  Callout type             : Hardware Callout
 12.51948|  CPU id                   : 4
 12.51950|  Target                   : Physical:/Sys0/Node0/Proc0/EQ0/EX0
 12.51951|  Deconfig State           : NO_DECONFIG
 12.51952|  GARD Error Type          : GARD_Fatal
 12.51952|  Priority                 : SRCI_PRIORITY_MED
 12.51953|------------------------------------------------
 12.51954|
 12.51954|------------------------------------------------
 12.51955|  System checkstop occurred during runtime on previous boot
 12.51956|------------------------------------------------
 12.51956|
 12.51957|------------------------------------------------
 12.51958|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.51959|================================================

The PCIe device being passed through is on Proc1, not Proc0, and the QEMU guest kernel panics if QEMU is not pinned to Proc1.

Is this likely to indicate a defective CPU or a hostboot / kernel bug?

EDIT: Apparently it's related to just the amdgpu driver on 5.1.16 -- still, hostboot shouldn't GUARD out slices based on a kernel bug like this. It makes recovery a lot harder and gives a false impression of a failing CPU.

@madscientist159 madscientist159 changed the title Sudden machine hang / GUARD event when QEMU vfio used Sudden machine hang / GUARD event when using amdgpu under Linux 5.1.16 Jul 7, 2019
@shawnanastasio
Copy link

I've encountered this issue using an amd gpu connected to a P9 host (no QEMU/VFIO) running kernel 5.1.15. Occasionally playing videos will hard reset the system and GUARD out cores.

The issue does not seem present with kernels 4.19.52 and 5.0.9 (the two other versions I happen to have on my system).

@dcrowell77
Copy link
Collaborator

FYI - This is being discussed by our RAS team this week. Basically the reason for the guard action is because many decisions were made assuming a solid hypervisor rather than a development kernel. Not all of the error paths have been scrubbed to assume a buggy kernel vs hardware failures.

@q66
Copy link

q66 commented Jul 14, 2019

I was able to get the machine to hang (randomly after about 2 days of uptime, didn't need to do anything like play video or use qemu) because of apparently the same amdgpu bug on 5.1.17 (with Radeon Pro WX 5100), but it didn't seem to GUARD out anything, I have a Talos 2 Lite with a single 18-core CPU and the latest stable firmware package 1.06. This is on serial console after I attempt a restart https://gist.github.com/q66/e26e10e3cd12ffc991d627351ce9d6a1

@madscientist159
Copy link
Author

@q66 @shawnanastasio is there an upstream kernel bug report tracking this issue yet?

@q66
Copy link

q66 commented Jul 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants