Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD Radeon Instinct MI50 in a KVM/QEMU fails with "atombios stuck in loop", Fatal error during GPU init #157

Open
JustGitting opened this issue Jan 4, 2024 · 33 comments

Comments

@JustGitting
Copy link

JustGitting commented Jan 4, 2024

Hi everyone, happy new year everyone!

I'm stumped trying to get an AMD MI50 GPU to work in a KVM/QEMU virtual environment.

Problem

The AMD Radeon MI50 is detected by both Ubuntu 22.04 and Alpine 3.19 if the proprietary linux-firmware is installed, running on bare-metal. However, the MI50 has a fatal error in a KVM/QEMU instance.

Hardware

Dell R720 server
Running latest firmware: 2.9.0
CPU 1 and 2: E5-2670 v2
2 x 1100W PSU's
GPU installed in x16 PCI Riser 2.

I tried to disable the integrated graphics option in the R720's bios, but it is greyed out.
Presumably because Dell only officially supported a small list of GPUs on the R720 .

Supported GPU cards listed on Pages 36-37 of Technical Guide https://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-r720_reference-guide_en-us.pdf

Below is the results with Ubuntu and Alpine running on bare-metal.

Ubuntu 22.04 (Live OS)
Kernel: 6.2.0-26-generic
linux-firmware: 20220329

Default settings used.

 ACPI: bus type drm_connector registered
 [drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 0
 fbcon: mgag200drmfb (fb0) is primary device
 mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: CRAT table not found
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
 [drm] register mmio base: 0xD4F80000
 [drm] register mmio size: 524288
 [drm] add ip block number 0 <soc15_common>
 [drm] add ip block number 1 <gmc_v9_0>
 [drm] add ip block number 2 <vega20_ih>
 [drm] add ip block number 3 <psp>
 [drm] add ip block number 4 <powerplay>
 [drm] add ip block number 5 <dm>
 [drm] add ip block number 6 <gfx_v9_0>
 [drm] add ip block number 7 <sdma_v4_0>
 [drm] add ip block number 8 <uvd_v7_0>
 [drm] add ip block number 9 <vce_v4_0>
 amdgpu 0000:44:00.0: amdgpu: Fetched VBIOS from ROM BAR
 amdgpu: ATOM BIOS: 113-D1631700-111
 [drm] UVD(0) is enabled in VM mode
 [drm] UVD(1) is enabled in VM mode
 [drm] UVD(0) ENC is enabled in VM mode
 [drm] UVD(1) ENC is enabled in VM mode
 [drm] VCE enabled in VM mode
 amdgpu 0000:44:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
 [drm] GPU posting now...
 amdgpu 0000:44:00.0: amdgpu: MEM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: SRAM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
 [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
 amdgpu 0000:44:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
 amdgpu 0000:44:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
 amdgpu 0000:44:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
 [drm] Detected VRAM RAM=32752M, BAR=32768M
 [drm] RAM width 4096bits HBM
 [drm] amdgpu: 32752M of VRAM memory ready
 [drm] amdgpu: 257959M of GTT memory ready.
 [drm] GART: num cpu pages 131072, num gpu pages 131072
 [drm] PCIE GART of 512M enabled.
 [drm] PTB located at 0x00000087FEF00000
 amdgpu 0000:44:00.0: amdgpu: PSP runtime database doesn't exist
 amdgpu 0000:44:00.0: amdgpu: PSP runtime database doesn't exist
 amdgpu: hwmgr_sw_init smu backed is vega20_smu
 [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
 [drm] PSP loading UVD firmware
 [drm] Found VCE firmware Version: 57.6 Binary ID: 4
 [drm] PSP loading VCE firmware
 [drm] reserve 0x400000 from 0x87fe000000 for PSP TMR
 amdgpu 0000:44:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: DTM: optional dtm ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
 [drm] Display Core initialized with v3.2.215!
 [drm] kiq ring mec 2 pipe 1 q 0
 [drm] UVD and UVD ENC initialized successfully.
 [drm] VCE initialized successfully.
 kfd kfd: amdgpu: Allocated 3969056 bytes on gart
 amdgpu: sdma_bitmap: ffff
 amdgpu: HMM registered 32752MB device memory
 amdgpu: Virtual CRAT table created for GPU
 amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
 kfd kfd: amdgpu: added device 1002:66a1
 amdgpu 0000:44:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 60
 amdgpu 0000:44:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 13 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
 amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
 amdgpu: Detected AMDGPU 2 Perf Events.
 [drm] Initialized amdgpu 3.49.0 20150101 for 0000:44:00.0 on minor 1
 systemd[1]: Starting Load Kernel Module drm...
 systemd[1]: [email protected]: Deactivated successfully.
 systemd[1]: Finished Load Kernel Module drm.

Alpine 3.19 (installed)
Kernel: 6.6.9
linux-firmware: 20231111

 ACPI: bus type drm_connector registered
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
 [drm] register mmio base: 0xD4F80000
 [drm] register mmio size: 524288
 [drm] add ip block number 0 <soc15_common>
 [drm] add ip block number 1 <gmc_v9_0>
 [drm] add ip block number 2 <vega20_ih>
 [drm] add ip block number 3 <psp>
 [drm] add ip block number 4 <powerplay>
 [drm] add ip block number 5 <dm>
 [drm] add ip block number 6 <gfx_v9_0>
 [drm] add ip block number 7 <sdma_v4_0>
 [drm] add ip block number 8 <uvd_v7_0>
 [drm] add ip block number 9 <vce_v4_0>
 amdgpu 0000:44:00.0: amdgpu: Fetched VBIOS from ROM BAR
 amdgpu: ATOM BIOS: 113-D1631700-111
 [drm] UVD(0) is enabled in VM mode
 [drm] UVD(1) is enabled in VM mode
 [drm] UVD(0) ENC is enabled in VM mode
 [drm] UVD(1) ENC is enabled in VM mode
 [drm] VCE enabled in VM mode
 amdgpu 0000:44:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
 [drm] GPU posting now...
 amdgpu 0000:44:00.0: amdgpu: MEM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: SRAM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7f7f] ras_mask[7f7f]
 [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
 amdgpu 0000:44:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
 amdgpu 0000:44:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
 amdgpu 0000:44:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
 [drm] Detected VRAM RAM=32752M, BAR=32768M
 [drm] RAM width 4096bits HBM
 [drm] amdgpu: 32752M of VRAM memory ready
 [drm] amdgpu: 257967M of GTT memory ready.
 [drm] GART: num cpu pages 131072, num gpu pages 131072
 [drm] PCIE GART of 512M enabled.
 [drm] PTB located at 0x00000087FEF00000
 amdgpu: hwmgr_sw_init smu backed is vega20_smu
 [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
 [drm] PSP loading UVD firmware
 [drm] Found VCE firmware Version: 57.6 Binary ID: 4
 [drm] PSP loading VCE firmware
 [drm] reserve 0x400000 from 0x87fe000000 for PSP TMR
 amdgpu 0000:44:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: DTM: optional dtm ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
 [drm] Display Core v3.2.247 initialized on DCE 12.1
 [drm] kiq ring mec 2 pipe 1 q 0
 [drm] UVD and UVD ENC initialized successfully.
 [drm] VCE initialized successfully.
 kfd kfd: amdgpu: Allocated 3969056 bytes on gart
 kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
 amdgpu: Virtual CRAT table created for GPU
 amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
 kfd kfd: amdgpu: added device 1002:66a1
 amdgpu 0000:44:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 60
 amdgpu 0000:44:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 8
 amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
 amdgpu: Detected AMDGPU 2 Perf Events.
 [drm] Initialized amdgpu 3.54.0 20150101 for 0000:44:00.0 on minor 0
 [drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 1
 fbcon: mgag200drmfb (fb0) is primary device
 mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device

The correct driver is assigned according to lspci -nnv

44:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] [1002:66a1] (rev 02)
	IOMMU group: 16
    Kernel driver in use: amdgpu

KVM/QEMU

I setup Alpine 3.19 as host with PCI passthrough using VFIO per the instructions at https://wiki.alpinelinux.org/wiki/KVM

After rebooting the host, the MI50 has the vfio-pci driver attached.

Alpine dmesg:

host $ sudo dmesg
	...
 modules=sd-mod,usb-storage,ext4,vfio,vfio-pci,vfio_iommu_type1,vfio_virqfd
	...
 ACPI: bus type drm_connector registered
 vfio_pci: add [1002:66a1[ffffffff:ffffffff]] class 0x000000/00000000
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 0
 fbcon: mgag200drmfb (fb0) is primary device
 mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device

host $ lspci -nnv

44:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] [1002:66a1] (rev 02)
..
        Kernel driver in use: vfio-pci

Using virt-manager on the host, I setup Ubuntu 22.04 as the guest OS with a Q35 chipet and UEFI bios.
I also add the MI50 PCI card during the hardware setup stage so the quest OS can access it.

After installing the OS and rebooting the Ubuntu VM, I get the following error about initialising the card.

vm $ dmesg 
	...
systemd[1]: Starting Load Kernel Module drm...
 ACPI: bus type drm_connector registered
 systemd[1]: [email protected]: Deactivated successfully.
 systemd[1]: Finished Load Kernel Module drm.
 [drm] Device Version 0.0
 [drm] Compression level 0 log level 0
 [drm] 12286 io pages at offset 0x1000000
 [drm] 16777216 byte draw area at offset 0x0
 [drm] RAM header offset: 0x3ffe000
 [drm] qxl: 16M of VRAM memory size
 [drm] qxl: 63M of IO pages memory ready (VRAM domain)
 [drm] qxl: 64M of Surface memory size
 [drm] slot 0 (main): base 0xc4000000, size 0x03ffe000
 [drm] slot 1 (surfaces): base 0xc0000000, size 0x04000000
 [drm] Initialized qxl 0.1.0 20120117 for 0000:00:01.0 on minor 0
 fbcon: qxldrmfb (fb0) is primary device
 qxl 0000:00:01.0: [drm] fb0: qxldrmfb frame buffer device
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: CRAT table not found
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
 [drm] register mmio base: 0xC9200000
 [drm] register mmio size: 524288
 [drm] add ip block number 0 <soc15_common>
 [drm] add ip block number 1 <gmc_v9_0>
 [drm] add ip block number 2 <vega20_ih>
 [drm] add ip block number 3 <psp>
 [drm] add ip block number 4 <powerplay>
 [drm] add ip block number 5 <dm>
 [drm] add ip block number 6 <gfx_v9_0>
 [drm] add ip block number 7 <sdma_v4_0>
 [drm] add ip block number 8 <uvd_v7_0>
 [drm] add ip block number 9 <vce_v4_0>
 amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
 amdgpu: ATOM BIOS: 113-D1631700-111
 [drm] UVD(0) is enabled in VM mode
 [drm] UVD(1) is enabled in VM mode
 [drm] UVD(0) ENC is enabled in VM mode
 [drm] UVD(1) ENC is enabled in VM mode
 [drm] VCE enabled in VM mode
 amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
 [drm] GPU posting now...
 [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
 [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 4EC8 (len 74, WS 0, PS 8) @ 0x4EE0
 amdgpu 0000:05:00.0: amdgpu: gpu post error!
 amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
 amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
 amdgpu: probe of 0000:05:00.0 failed with error -22

Ubuntu has the necessary linux-firmware installed.

I've searched extensively for this problem, but mostly found dead-ends or problems that were not the same.
There are few people with the problem, but no definitive answers.

Help with RadeonVII error "atombios stuck in a loop" (not a ROCm issue)
ROCm/ROCm#1320

What does this mean? (MI60)
https://community.amd.com/t5/graphics-cards/what-does-this-mean/td-p/599894

I would appreciate any help solving this problem so I can actually use rocm (https://github.com/ROCm/ROCm) as I'm stumped with no leads.

I'm not sure if it's a Dell Server issue or the MI50 doesn't like being in a VM (...like Nvidia that charge extra fees for the privilege)...or I've not setup the KVM/QEMU correctly.

@kentrussell
Copy link
Contributor

I believe that we saw this internally as well, and it was resolved with a different VBIOS. It was a quirk with MI50 specifically, since other VG20 SKUs were fine. I'm trying to find out if/where we published the solution (setting the GPU as a native PCIe endpoint in the VBIOS)

@JustGitting
Copy link
Author

Hi @kentrussell,

Thanks for looking into this. Any luck finding the documentation or solution for this problem?

I'm keen to use ROCm for learning MLOps, at least until AMD drops support for the MI50 😢 🤔

@kentrussell
Copy link
Contributor

I hadn't heard back from the guys yet, so I'll try to ping them again. It seems like a bit of a surprise that we'd identify something and not release a fix for it. But maybe the internal issue didn't document the whole process. I'll hopefully have something within the week

@JustGitting
Copy link
Author

Thank you @kentrussell, appreciate your efforts!

@kentrussell
Copy link
Contributor

So I got in touch with a virtualization expert. He was asking about why vfio-pci was enabled when it's already on passthrough. If you disable vfio-pci and run the KVM without it, does it throw the same error?

@JustGitting
Copy link
Author

He was asking about why vfio-pci was enabled when it's already on passthrough.

Sorry, I don't understand the question. From what I've read, vfio enables pci-passthrough on the host to allow a VM to use the hardware.

On my setup the vfio-pci is enabled only on the host side, not the VM/client side. Just to clarify the message about the vfio-pci kernel driver is from the host.

host $ lspci -nnv

44:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] [1002:66a1] (rev 02)
..
        Kernel driver in use: vfio-pci

If you disable vfio-pci and run the KVM without it, does it throw the same error?

If I don't setup and enable vfio on the host, then I could not add the GPU (pci 44:00.0 device) in the VM hardware setup. It would complain the device had not been setup for pci-passthrough.

Are you able clarify with your colleague the meaning of "... why vfio-pci was enabled when it's already on passthrough."?

Is there another way to get passthrough?

@JustGitting
Copy link
Author

Hi @kentrussell

Just wondering if you've had any luck from your colleagues or other teams?

@kentrussell
Copy link
Contributor

Sorry for the delays on it. I keep poking around and am getting passed around to different contacts, but haven't found anything useful yet. Currently there isn't a new VBIOS to fix it. I have been trying to find out if there's a way to manually force the change through other tools to no avail. I'm still hopeful though.

@JustGitting
Copy link
Author

Hi @kentrussell,

I hope you have been well. Any progress with this issue? Any hacks 🔨 around the problem? .

... or could you convince the AMD gods to release all GPU VBIOS's as free and open source? 😃 It would really help the compute ecosystem and help users all over the world. 🚀

@kentrussell
Copy link
Contributor

kentrussell commented Feb 12, 2024

@nartmada I haven't had any luck trying through all of my unofficial channels and contacts. Think you can make a proper JIRA and assign it to the MI50 program to see if we can get this addressed? I'll ping you with the old JIRA so we can link against it. I just don't know where else we can get a workaround/fix

@JustGitting
Copy link
Author

Thanks @kentrussell for your efforts!

@JustGitting
Copy link
Author

Hi @nartmada, @kentrussell,

Any luck by creating the JIRA ticket?

@nartmada
Copy link

@JustGitting. Apologies for my slow progress. JIRA ticket has been created and I am following up with the MI50 folks.

@JustGitting
Copy link
Author

Great, thank you @nartmada! 🚀

@JustGitting
Copy link
Author

Hi @nartmada @kentrussell,

Just pinging you both, hoping to hear some good news 😄

I'm happy to test any procedures, hacks or ideas that have come up so far in solving this problem. 🪛

@JustGitting
Copy link
Author

Hi @nartmada @kentrussell,

Any progress with the JIRA ticket?

@kentrussell
Copy link
Contributor

Sorry, at last check it was still bouncing around. Trying to find who owns it is harder, since some teams work on the latest HW and transition support for specific HW to other teams. Adam might have an update if I've missed it though

@JustGitting
Copy link
Author

Thank you for chasing this up @kentrussell

Getting support/fixes for old(er) hardware has always been hard (or non-existent) in the computer industry. Getting customers to upgrade all the time is more profitable than supporting products long term. Lack of support for products is a form of "designed obsolescence". /rant

Hence, I appreciate you're efforts.

I've started trying to debug this again. If I find anything I'll share here.

@JustGitting
Copy link
Author

Minor update

I've tried qemu xml options for the VM and/or changed host kernel options without success.

I thought my only option was to try to pass the vbios to the VM, as suggested by https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF, among others.

<hostdev>
     ...
     <rom file='/path/to/your/gpu/bios.bin'/>
     ...
</hostdev>

I've searched techpowerup.com for the MI50 vbios but nothing turns up.
Unfortunately neither documentation or the vbios turns up on AMD website either....

Hence, I've tried to manually dump the GPU vbios, but have also been unsuccessful.

On the host running Alpine, I've disabled all vfio and blacklisted the amdgpu so the card should not be initialised.
Which I understand can block subsequent reads of the rom [1].

Kernel options:
intel_iommu=on iommu=pt module_blacklist=amdgpu,radeon

For good measure:

# cat /etc/modules-load.d/blacklist.conf 
blacklist amdgpu
blacklist radeon

I've rebuilt the initramfs and rebooted.

Login as root
$ sudo su -

No amdgpu module is loaded.
# lsmod | grep amdgpu

No drivers are associated with the GPU.

# lspci -k -s 44:00
44:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] (rev 02)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0834

The ROM file is 128KB, indicating it's a UEFI vbios.

# ls -hl /sys/bus/pci/devices/0000:44:00.0/
...
-rw------- 1 root root 128K Apr  6 11:00 rom
...

Dump vbios.

# echo 1 | tee -a /sys/bus/pci/devices/0000:44:00.0/rom
# cat /sys/bus/pci/devices/0000:44:00.0/rom > amd_mi50_vbios.bin

However, the amd_mi50_vbios.bin is only 41KB

Q. How do I dump the vbios from the AMD GPU? Or it's been disable?

Or can AMD please release the latest vbios for their enterprise GPU's?

  1. https://github.com/SpaceinvaderOne/Dump_GPU_vBIOS
    "However if the vbios looks incorrect and the vbios is under 70kb then it was probably dumped from a primary GPU. This is because the vbios was shadowed during the boot process and so the resulting vbios is a small file which is not correct. So the script will now disconnect the GPU then put the server to sleep. Next it will prompt you to press the power button to resume the server from its sleep state. Once server is woken the script will rescan the pci bus reconnecting the GPU. This now allows the primary gpu to be able to have the vbios dumped correctly. Script will then redump the vbios again putting the vbios in the loaction specified in the script (defualt /mnt/user/isos/vbios)"

@JustGitting
Copy link
Author

Hi @kentrussell, @nartmada,

Any suggestions for how to dump the vbios on the MI50?
Or will AMD release the vbios?

@nartmada
Copy link

@JustGitting, sorry for the slow response.

I will get the answers to your 2 questions:
Q. How do I dump the vbios from the AMD GPU? Or it's been disable?
Or can AMD please release the latest vbios for their enterprise GPU's?

@JustGitting
Copy link
Author

Thank you @nartmada.

@kentrussell
Copy link
Contributor

Thankfully dumping the VBIOS is easy on the terminal:
sudo cat /sys/kernel/debug/dri/X/amdgpu_vbios > vbios.rom
where X is the GPU that you want to dump

@JustGitting
Copy link
Author

@kentrussell, @nartmada

I've booted to a live Ubuntu which found and initialised the AMD gpu without error.

I did the following as suggested:

$ sudo su -
# cat /sys/kernel/debug/dri/1/amdgpu_vbios > vbios.rom

However, the vbios.rom file is still only 41KB... this is frustrating 🆘 ☹️

Q1. What size should the vbios.rom file be?

Q2. How do I check if the vbios.rom is valid?

Q3. Any other approaches?

@JustGitting
Copy link
Author

JustGitting commented May 5, 2024

@kentrussell, @nartmada

Great news, I've found out how to initialise the GPU in the VM.

1. Dumping Video bios ROM (not needed for VM initialisation, but documenting the method for others)

I've found how to dump the AMD GPU firmware ROM. It turns out that the ROM is not directly accessible via standard methods as detailed previously.
The AMD Instinct GPU cards have three (3) firmware ROM's and a security engine that controls access to the initialisation process and accessing the VBIOS according to the discussion at https://www.reddit.com/r/Amd/comments/16oiecw/mi50_bios_flash/.
The ROM can only be accessed by the proprietary amdvbflash tool that knows how to do the secret handshake with the card.

Downloaded the GNU/Linux version amdvbflash_linux_4.71.zip from https://www.techpowerup.com/download/ati-atiflash/.

I installed the GNU/Linux amdvbflash tool and executed the following commands:

@host $ sudo ./amdvbflash
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.

--- amdvbflash v4.71 ---
-h, -?, /h, /?          Help (this screen)

Format: amdvbflash [command] [parameter1] [parameter2] [parameter3] <option/s>
[command]:
-i [Num]                Display information of AMD adapters in the system.
                        Display information of adapter [Num] if specified.
-ai [Num]               Display advanced information of AMD adapters on system.
                        Display advanced information of adapter [Num]
                        if specified.
-biosfileinfo <File>    Displays the Bios info in file <file>
-p <Num> <File>         Write BIOS image in file <File> to flash ROM in Adapter
                        <Num>.
-pa [-padevid=] <File>  Write BIOS image <File> to all appropriate adapters.
                        Use with -padevid or -passid or -pasvid or -pavbpn or -fp.
                        Command example:
                        command = -pa -padevid=0xXXXX a123.bin 
-s <Num> <File> [Size]  Save BIOS image from adapter <Num> to file <File>.
                        First [Size] kbytes (except for Theater in bytes) of ROM
                        content is saved if [Size] is specified.
-cf <File> [Sum]        Calculate 16-bit checksum for file <File>.
                        Checksum for the file is compared to [Sum] which is
                        the expected checksum 
-cb <Num> [Sum]         Calculate 16-bit BIOS image checksum for adapter <Num>.
                        Checksum for the BIOS image is compared to [Sum] which is
                        the expected checksum 
-cr <Num> [Size] [Sum]  Calculate 16-bit ROM checksum for adapter <Num> and
                        compare it to the [Sum] specified.  This command is
                        the same as -cb if [Size] is specified.
-t <Num>                Test ROM access of adapter <Num>
-v <Num> <File>         Compare ROM content of adapter <Num> to <File>
-mi <Num> [ID]          Modify SSID & SVID in BIOS image of adapter <Num> to
                        <ID>.  SSID & SVID in BIOS image of adapter <Num> is
                        displayed if [ID] is not specified.
-mb <Num> <File>        Modify SSID, SVID, BIOS Pin Number, & Boot Message in
                        BIOS image of adapter <Num> to values in <filename>.
                        Input file example:
                           ssid = 715B
                           svid = 1002
                           biospn = "113-xxxxxx-xx"
                           bootmsg = "AMD graphic board"
-pak <File>             Package an executable for BIOS update according to
                        the commands in <File>.
                        Config file example:
                           outfile = update.exe
                           banner = "Update v1.0"
                           infile = a123.bin
                           command = -pa -padevid=715B infile
-isr <Num> <Build Number> <Board Number>  Set ISR number based on the given
                                          build and board number
                                          if not specified, print out ISR Number
-prod <Num> <12 digit serial number>      Set PROD number based
                                          on the given serial number
                                          if not specified, print out SN Number
-checkprodsn <Num> <12 Digit Serial Number>  Comparing the Prod SN based on 
                                             existing prod sn saved in ROM 

<option/s>:
-f              Force flashing regardless of security checkings (e.g. AsicID &
                BIOS file info check OR boot-up card).
-fm             Force flashing bypassing BIOS memory config check.
-fs             Force flashing bypassing BIOS SSID check.
-fp             Force flashing bypassing BIOS P/N check.
-fa             Force flashing bypassing already-programmed check.
-fv             Force flashing bypassing newer BIOS version check.
-nw             No user interaction on test failure. 
-sst            Use SST25VFxxx flashing algorithm regardless of ROMID straps.
-st             Use ST M25Pxx flashing algorithm regardless of ROMID straps.
-atmel          Use AT25Fxxx flashing algorithm regardless of ROMID straps.
-nopci          Do not enumerate PCI adapters, i.e. enumerate only AGP and
                PCIe adpaters
-pcionly        Enumerate only PCI adapters, i.e. do not enumerate AGP and
                PCIe adapters
-agp            Enumerate only AGP adapters, i.e. do not enumerate PCI and
                PCIe adapters unless used with -pcie or -pci
-noagp          Do not enumerate AGP adapters, i.e. enumerate only PCI and
                PCIe adpaters
-pcie           Enumerate only PCIe adapters, i.e. do not enumerate AGP and
                PCI adapters unless used with -agp or -pci
-nopcie         Do not enumerate PCIe adapters, i.e. enumerate only AGP and
                PCI adpaters

-pci            Enumerate only PCI adapters, i.e. do not enumerate AGP and
                PCIe adapters unless used with -agp or -pcie
-maxsegtoscan=# Limits PCI segment group number to be scanned for devices to the specified value.
-maxbustoscan=# Limits PCI bus number to be scanned for devices to the specified value.
-reboot         Force a reboot of the system after successfully completing the
                specified operation
-keepisrsn      keep the ISR Number on the adapter when flashing a new VBIOS
-keepprodsn     keep the Prod SN on the adapter when flashing a new VBIOS
-siireset       Specifies the GPIO Pin to be used as the Reset when updating
                SiI1930 microcontroller firmware
                Input example:
                   -siireset=7 <No Spaces>
-siiuprog       Specifies the GPIO Pin to be used as the uprog when updating
                SiI1930 microcontroller firmware
                Input example:
                   -siiuprog=14 <No Spaces>
-scansii        Overrides normal adapter detection to enable detecting SSI
                roms with/without TPI firmware
-log            Logs output to amdvbflash.log, overrides existing file
-logappend      Logs and appends output to amdvbflash.log
-ddc            Enable DDC support
-padevid=<ID>   Use with -pa command to update adapters of specific device ID.
-passid=<ID>    Use with -pa command to update adapters of specific SSID.
-pasvid=<ID>    Use with -pa command to update adapters of specific SVID.
-pavbpn=<VBPN>  Use with -pa command to update adapters of specific VBIOS PN.
-excl_memtrain_dtable     When flashing on new VBIOS, a pre-determined memory
                          training data table in the old VBIOS will not be
                          overwritten.
-isr <adapter num> [build num] [board num]     If build number and board number
                                               are specified, sets the ISR Number
                                               value in specified adapter.
                                               If only adapter is specified, the current
                                               ISR Number is displayed
-checkpn <adapter num> <filename>              Checks PN of the current product
                                               and compares it to external file
-rsa <filename>                                Verify VBIOS immage file RSA signature
*<Num> = adapter number, <File> = filename
*[Size] = data block size in KBytes, except for Theater Pro in Bytes
*Use command -i to see the adapter numbers in the system.

Check what AMD cards are available:

@host $ sudo ./amdvbflash -i
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.

adapter seg  bn dn dID       asic           flash      romsize test    bios p/n    
======= ==== == == ==== =============== ============== ======= ==== ================
   0    0000 44 00 66A1 Vega20          GD25Q80C        100000 pass 113-D1631700-111

Dump vbios ROM from MI50 using proprietary amdvbflash tool by running:

@host $ sudo ./amdvbflash -s 0 amd_mi50_vbios_113-D1631700-111.rom

Check information in ROM.

@host $ sudo ./amdvbflash -biosfileinfo amd_mi50_vbios_113-D1631700-111.rom 
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.

    Product Name is :    Vega20 A1 SERVER XL D16317 Hynix/Samsung 32GB 8HI 
    Device ID is    :    66A1
    Bios Version    :    016.004.000.056.013522
    Bios P/N is     :    113-D1631700-111
    Bios SSID       :    0834
    Bios SVID       :    1002
    Bios Date is    :    01/16/20 21:38 

It's a big bios rom file at 1MB:

@host $ ls -l amd_mi50_vbios_v016.004.000.056.013522.rom
...
-rw-r--r-- 1 user users 1.0M Apr 20 08:01 amd_mi50_vbios_v016.004.000.056.013522.rom

2. Fixing "Atombios stuck in loop"

Well, it turns out dumping the ROM file and passing it to the VM is not necessary.

The card needs to be reset (re-initialised) by the vendor-reset module (https://github.com/gnif/vendor-reset). This is because the card cannot be reset by the standard kernel methods (in pci_quirks?) before the VM accesses it.

I had found someone with the same "atombios stuck in loop" problem in the discussion www.reddit.com/r/VFIO/comments/oxsku7/vfio_amd_vega20_gpu_passthrough_issues/.
The OP said they had resolved the problem. "So I somewhat got it with working with your suggestion. I had to upgrade the kernel to 5.4.0-66-generic as I could not get the reset module installed on 4.15.0-144". I assumed the solution was the OP upgraded their kernel from 4.15 to 5.4, a big jump. I assumed the "vendor-reset" was secondary because all I had read about it was people using it to solve the "reset bug". Where the GPU card was unavailable post VM shutdown (e.g. https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/).

Eventually I tried vendor-reset out of desperation and it solved the "atombios stuck in loop" problem.

Check kernel has supported features (all should be 'y') for the module:

@host $ sudo grep -E "CONFIG_FTRACE=|CONFIG_KPROBES=|CONFIG_PCI_QUIRKS=|CONFIG_KALLSYMS=|CONFIG_KALLSYMS_ALL=|CONFIG_FUNCTION_TRACER=" /boot/config-`uname -r`
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KPROBES=y
CONFIG_PCI_QUIRKS=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
@host $ git clone https://github.com/gnif/vendor-reset.git

Build module:

@host $ sudo dkms install .

Add the vendor module name to modules config so it is loaded early in the boot process.

@host $ sudo cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

vendor-reset

I've also added the module name to /etc/modules-load.d/modules.conf (unnecessarily?)

@host $ sudo cat /etc/modules-load.d/modules.conf
vendor-reset

Update initramfs:

@host $ sudo update-initramfs -k all -u
update-initramfs: Generating /boot/initrd.img-6.5.0-28-generic
update-initramfs: Generating /boot/initrd.img-6.2.0-26-generic

I'd assume the module would be included in the initramfs, but it is not. I guess updating initramfs is unnecessary (?).

@host $ sudo lsinitramfs /boot/initrd.img-`uname -r` | grep -i vendor
nothing...

Reboot machine.

UPDATED: No need to use "workaround". Just need to copy udev rules from vendor-reset to correct udev directory location (gnif/vendor-reset#46 (comment)).

cp udev/99-vendor-reset.rules /etc/udev/rules.d/

The "reset_method" of the card needs to be manually changed to workaround a bug before you start the VM, just after booting the host. (see: _vendor-reset stopped working with kernel 5.15 and is also present with 5.16 (affected are definitely Debian, Arch and Gentoo_, https://github.com/gnif/vendor-reset/issues/46)~~

Need real root access for this:

@host $ sudo su -

Original reset method setting:

@host # cat /sys/bus/pci/devices/0000:44:00.0/reset_method
bus

Change reset method:

@host # echo "device_specific" > /sys/bus/pci/devices/0000:44:00.0/reset_method

@host # cat /sys/bus/pci/devices/0000:44:00.0/reset_method
device_specific

Because the GPU has a large amount of RAM (32GB), the PCI BAR (Base Address Register) (https://wiki.osdev.org/PCI#Base_Address_Registers) size needs to be increased in the VM (www.reddit.com/r/VFIO/comments/oxsku7/vfio_amd_vega20_gpu_passthrough_issues/).

Otherwise you get errors in the VM about insufficient space for PCI BAR allocation.

<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
	...
  <qemu:commandline>
    <qemu:arg value="-cpu"/>
    <qemu:arg value="host,host-phys-bits=on"/>
    <qemu:arg value="-fw_cfg"/>
    <qemu:arg value="opt/ovmf/X-PciMmio64Mb,string=65536"/>
  </qemu:commandline>
</domain>

Start the VM and check that the GPU has initialised:

@vm $ sudo dmesg | grep -Ei "bios|gpu|amd|drm"
[    0.000000] SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    0.020666] RAMDISK: [mem 0x669a1000-0x6ce02fff]
[   24.981985] ACPI: bus type drm_connector registered
[   25.005362] amdkcl: loading out-of-tree module taints kernel.
[   28.282155] [drm] amdgpu kernel modesetting enabled.
[   28.282161] [drm] amdgpu version: 6.7.0
[   28.282162] [drm] OS DRM version: 6.5.0
[   28.282320] amdgpu: Virtual CRAT table created for CPU
[   28.282342] amdgpu: Topology: Add CPU node
[   28.318933] amdgpu: PeerDirect support was initialized successfully
[   28.319860] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
[   28.319883] [drm] register mmio base: 0xC9200000
[   28.319885] [drm] register mmio size: 524288
[   28.320275] [drm] add ip block number 0 <soc15_common>
[   28.320279] [drm] add ip block number 1 <gmc_v9_0>
[   28.320281] [drm] add ip block number 2 <vega20_ih>
[   28.320282] [drm] add ip block number 3 <psp>
[   28.320284] [drm] add ip block number 4 <powerplay>
[   28.320285] [drm] add ip block number 5 <dm>
[   28.320287] [drm] add ip block number 6 <gfx_v9_0>
[   28.320288] [drm] add ip block number 7 <sdma_v4_0>
[   28.320289] [drm] add ip block number 8 <uvd_v7_0>
[   28.320291] [drm] add ip block number 9 <vce_v4_0>
[   28.354277] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   28.354283] amdgpu: ATOM BIOS: 113-D1631700-111
[   28.360848] [drm] UVD(0) is enabled in VM mode
[   28.360851] [drm] UVD(1) is enabled in VM mode
[   28.360851] [drm] UVD(0) ENC is enabled in VM mode
[   28.360852] [drm] UVD(1) ENC is enabled in VM mode
[   28.360853] [drm] VCE enabled in VM mode
[   28.360866] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   28.360891] amdgpu 0000:05:00.0: amdgpu: PCIE atomic ops is not supported
[   28.360899] [drm] GPU posting now...
[   28.361352] amdgpu 0000:05:00.0: amdgpu: MEM ECC is active.
[   28.361354] amdgpu 0000:05:00.0: amdgpu: SRAM ECC is active.
[   28.361363] amdgpu 0000:05:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[67f7f] ras_mask[67f7f]
[   28.361374] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   28.361409] amdgpu 0000:05:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[   28.361413] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   28.361424] [drm] Detected VRAM RAM=32752M, BAR=32768M
[   28.361426] [drm] RAM width 4096bits HBM
[   28.361919] [drm] amdgpu: 32752M of VRAM memory ready
[   28.361924] [drm] amdgpu: 3956M of GTT memory ready.
[   28.361974] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   28.362112] [drm] PCIE GART of 512M enabled.
[   28.362115] [drm] PTB located at 0x00000087FEF00000
[   28.385084] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
[   28.390279] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[   28.390338] [drm] PSP loading UVD firmware
[   28.394989] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[   28.395115] [drm] PSP loading VCE firmware
[   28.546950] amdgpu 0000:05:00.0: amdgpu: reserve 0x400000 from 0x87fe000000 for PSP TMR
[   28.630532] amdgpu 0000:05:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[   28.630535] amdgpu 0000:05:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[   28.630538] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   28.630540] amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[   28.634068] [drm] Display Core v3.2.269 initialized on DCE 12.1
[   28.636967] [drm] kiq ring mec 2 pipe 1 q 0
[   28.679297] [drm] UVD and UVD ENC initialized successfully.
[   28.878153] [drm] VCE initialized successfully.
[   29.305351] amdgpu: HMM registered 32752MB device memory
[   29.327912] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   29.327958] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[   29.328594] amdgpu: Virtual CRAT table created for GPU
[   29.329216] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
[   29.329226] kfd kfd: amdgpu: added device 1002:66a1
[   29.349181] amdgpu 0000:05:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 60
[   29.349199] amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[   29.349204] amdgpu 0000:05:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
[   29.349208] amdgpu 0000:05:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
[   29.349211] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
[   29.349216] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
[   29.349219] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
[   29.349222] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
[   29.349226] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
[   29.349229] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
[   29.349233] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
[   29.349236] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
[   29.349239] amdgpu 0000:05:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
[   29.349243] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
[   29.349246] amdgpu 0000:05:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
[   29.349249] amdgpu 0000:05:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
[   29.349253] amdgpu 0000:05:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
[   29.349257] amdgpu 0000:05:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
[   29.349260] amdgpu 0000:05:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
[   29.349264] amdgpu 0000:05:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
[   29.349267] amdgpu 0000:05:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 8
[   29.349271] amdgpu 0000:05:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
[   29.349274] amdgpu 0000:05:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
[   29.349277] amdgpu 0000:05:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 8
[   29.349281] amdgpu 0000:05:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 8
[   29.349284] amdgpu 0000:05:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 8
[   29.357418] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
[   29.357497] amdgpu: Detected AMDGPU 2 Perf Events.
[   29.376346] [drm] Initialized amdgpu 3.57.0 20150101 for 0000:05:00.0 on minor 0
[   29.886483] [drm] Fence fallback timer expired on ring uvd_0
[   30.398468] [drm] Fence fallback timer expired on ring uvd_0
[   30.910496] [drm] Fence fallback timer expired on ring uvd_enc_0.0
[   31.422472] [drm] Fence fallback timer expired on ring uvd_enc_0.1
[   31.934497] [drm] Fence fallback timer expired on ring uvd_1
[   32.446497] [drm] Fence fallback timer expired on ring uvd_1
[   32.958490] [drm] Fence fallback timer expired on ring uvd_enc_1.0
[   33.470504] [drm] Fence fallback timer expired on ring uvd_enc_1.1
[   34.316893] systemd[1]: Starting Load Kernel Module drm...
[   34.379273] systemd[1]: [email protected]: Deactivated successfully.
[   34.379841] systemd[1]: Finished Load Kernel Module drm.
[   35.803108] [drm] Device Version 0.0
[   35.803112] [drm] Compression level 0 log level 0
[   35.803114] [drm] 12286 io pages at offset 0x1000000
[   35.803117] [drm] 16777216 byte draw area at offset 0x0
[   35.803119] [drm] RAM header offset: 0x3ffe000
[   35.803315] [drm] qxl: 16M of VRAM memory size
[   35.803320] [drm] qxl: 63M of IO pages memory ready (VRAM domain)
[   35.803322] [drm] qxl: 64M of Surface memory size
[   35.824090] [drm] slot 0 (main): base 0xc4000000, size 0x03ffe000
[   35.824241] [drm] slot 1 (surfaces): base 0xc0000000, size 0x04000000
[   35.824944] [drm] Initialized qxl 0.1.0 20120117 for 0000:00:01.0 on minor 1
[   35.827064] fbcon: qxldrmfb (fb0) is primary device
[   35.877567] qxl 0000:00:01.0: [drm] fb0: qxldrmfb frame buffer device

All looks great. I've been able to install rocm successfully, at least the installation seems to have worked. Yet to use it for actual ML.

3. Suggestions.

This is not targeting AMD staff who are clearly doing their best, its a problem with leadership's lack of vision and strategy.

After wasting so much time to fix this problem myself, I hope AMD can do better by users and the community.
It's very frustrating AMD doesn't have the internal documentation, support systems or knowledgeable staff to help with a) dumping VBIOS from their own GPUs or b) how to correctly initialise the card. Which is sort of important...

3.1. FOSS-alise amdvbflash tool.

It's clear GNU/Linux is not a priority as the necessary AMDvbflash tool for advanced BIOS tasks has not been updated since 20th March 2020. I also cannot find the official release from AMD either.

Please push for the amdvbflash to be made Free and Open source software, then the community could learn and update the tool for GNU/Linux users. We would no longer be dependant on AMD for some support.

3.2. Fix reset problem.

It is crazy that AMD's later GPU's are unusable for VM's unless a 3rd party tool is used (vendor-reset module).

It is a major problem for us poor users that AMD doesn't document the problem, fixes this problem or even provide support for the ONE developer (@gnif) who created the workaround. Especially given the reset work-around (vendor-reset) needs another current workaround (gnif/vendor-reset#46) for it to work at all.

Please fix this problem in a more permanent / sustainable manner. Or provide support to the SINGLE developer who is fixing AMD's problems and who is a single point of failure. We know this is a major risk as shown by the xz/liblxma ssh attempted backdoor (oss-security - backdoor in upstream xz/liblzma leading to ssh server compromise https://www.openwall.com/lists/oss-security/2024/03/29/4)

@gnif
Copy link

gnif commented May 8, 2024

Especially given the reset work-around (vendor-reset) needs another current workaround (gnif/vendor-reset#46) for it to work at all.

That issue is old and should have been closed. It's not a "work around", the kernel was enhanced to allow exactly what vendor-reset does without as much jank as it had prior. I have updated the issue and posted the proper solution.

@JustGitting
Copy link
Author

JustGitting commented May 8, 2024

Thanks @gnif! 👍

@ppanchad-amd
Copy link

@JustGitting Please advise if we can go ahead and close this ticket. Thanks!

@JustGitting
Copy link
Author

Hi @ppanchad-amd,

This issue has not been fixed as far as I'm aware. There has been no response from @kentrussell, @nartmada or anyone else from AMD regarding the points I made in the previous post.

Is there any plans from AMD to fix this issue? Was there a fix? Any updated documentation for a workaround to the problem?

@ppanchad-amd
Copy link

@JustGitting We have an internal ticket to investigate this issue. I will follow up to get an update. Thanks!

@JustGitting
Copy link
Author

@ppanchad-amd Any luck with an update?

@JustGitting
Copy link
Author

Hi @ppanchad-amd, any news regarding this issue? Thanks for chasing this up.

@ppanchad-amd
Copy link

@JustGitting I have followed up with the internal team and they are looking at how to resolve this issue. I will keep you posted. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants