Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPU fw upgrade/reboot caused Host crash due to PCIe DPC events #180

Open
glimchb opened this issue Nov 28, 2022 · 6 comments
Open

DPU fw upgrade/reboot caused Host crash due to PCIe DPC events #180

glimchb opened this issue Nov 28, 2022 · 6 comments
Assignees

Comments

@glimchb
Copy link
Member

glimchb commented Nov 28, 2022

@ballle98 @tedstreete @jainvipin can you please add all the details and thoughts and debug info that you have on this, we can start bringing more people and don't want them to read the entire slack to understand the issue

@glimchb glimchb self-assigned this Nov 28, 2022
@tedstreete
Copy link
Contributor

tedstreete commented Dec 5, 2022

@glimchb @ballle98 @jainvipin Here's an initial thought on DPU/HOST DPC behavior for Host-reset, DPU-reset and DPU OS install events.

Host OS Reset or Crash

  • During Host OS Reboot or Crashes, the Host writes to Chipset Reset Control Register (i.e., IO port 0xCF9) or to the ACPI FADT. The Platform (CPLD) firmware can monitor for these events and trigger Host BMC/BIOS to coordinate with DPU to take appropriate action. Potential actions include:-
    • Ensuring that DPU is gracefully shut down before a full Host reboot
    • Force reset DPU after a timeout if DPU has failed to gracefully shutdown
    • Allow DPU to continue executing if there is no requirement for Host/DPU dependency

DPU OS Reset or Crash

  • The Host (BMC/BIOS) firmware and Host OS needs to implement DPC and hot plug requirements defined in PCIe Firmware spec 3.3 for these scenarios to work.
  • Coordinated shutdown: the DPU can provide in-band or out-of band signalling to Host OS that it intends to restart. The Host OS can respond by either
    • Ignoring the event if the Host OS intends to continue execution after the remove event, containing the subsequent surprise removal
    • Host OS reset (see above), to coordinate reset of other DPUs on the Host bus
  • DPU OS crash or reset.
    • DPU will trigger DPC/CI surprise remove event. Host OS and BMC/BIOS will need to handle these events gracefully so that Host OS remains active for long enough for the DPU OS crash dump to complete.
    • If Host OS is tightly coupled to DPU function, it can elect to gracefully shutdown when the DPU crashdump has completed.
    • Alternatively, if Host OS is losely coupled or decoupled from DPU (if DPU is acting as a bump in the wire service for example), the Host can elect to contain the event, removing the DPU from the bus table. when the DPU restarts, it will generate a hot-plug event on the PCIe bus, allowing the Host BIOS/OS to reinsert the DPU.
  • Note: This behavior is an largely untested feature in Linux Host OS, and is not available on Windows

DPU OS install mode

  • A DPU may go through multiple reboots during DPU OS installation. Consequently, the Host OS may choose to bring the DPU and Host PCIe link down prior to DPU OS install, and to hold the link down until the install process has completed.
  • The DPU can use the NC-SI connection to signal to the Host that DPU OS installation has completed, allowing the Host to enable the Host/DPU PCIe link.
  • Alternatively, if the Host/DPU has no NC-SI connection, then the Host can wait for a pre-defined timeout period before enabling the PCIe Host/DPU link
  • Added Dec 6 The installer executing on the host, may be able to take down and restore the Host/DPU link without explicit support for DPC in BMC/BIOS

@glimchb
Copy link
Member Author

glimchb commented Dec 5, 2022

thanks @tedstreete
this info is useful for https://github.com/opiproject/opi-prov-life/blob/main/BOOTSEQ.md

I was hoping we can use this issue to understand and debug why FW upgrade/reboot even cause Host to crash completely?

DPU OS crash or reset.
DPU will trigger DPC/CI surprise remove event. Host OS and BMC/BIOS will need to handle these events gracefully so that Host OS remains active for long enough for the DPU OS crash dump to complete.

Why DPC is not working? do we have kernel dumps to attach here and show what happens when DPU reboots and causes Host to crash ?

@tedstreete
Copy link
Contributor

@glimchb The primary issue is that neither of the two host OS properly manage PCI surprise remove events. The historical expectations that a failure of a PCIe device will always result in a Host OS crash. The introduction of independently functional devices, like DPUs, breaks that expectation.

  • Linux does have support for DPC, but the functionality is limited. The available functionality can be enabled through the kernel configuration at Linux Kernel Configuration ─> Device Drivers ─> PCI support ─> PCI Express Downstream Port Containment support
  • Windows has no effective support for DPC, other than to blue-screen the OS after writing crash logs
  • ESXi, like Windows will write crash logs and then restart both host and DPU

OPI will need to determine what behaviors we want the host OS to offer in the event of DPU crash/reset/graceful-restart and then make the necessary changes to the Linux Kernel/PCIe subsystem and the host BIOS/BMC (iDRAC for Dell, iLo for HP etc.).

@seroyer
Copy link
Contributor

seroyer commented Dec 5, 2022

Just as a data point, Fedora, CentOS, and RHEL all enable the DPC support by default.

For example: From a RHEL 8.6 host:

$ grep CONFIG_PCIE_DPC /boot/config-4.18.0-372.26.1.el8_6.x86_64
CONFIG_PCIE_DPC=y

And from the tip of rawhide:

$ grep CONFIG_PCIE_DPC kernel-x86_64-*.config
kernel-x86_64-debug-fedora.config:CONFIG_PCIE_DPC=y
kernel-x86_64-debug-rhel.config:CONFIG_PCIE_DPC=y
kernel-x86_64-fedora.config:CONFIG_PCIE_DPC=y
kernel-x86_64-rhel.config:CONFIG_PCIE_DPC=y

@glimchb
Copy link
Member Author

glimchb commented Dec 5, 2022

Linux does have support for DPC, but the functionality is limited.

@tedstreete can you please elaborate ?

I know Intel is doing a lot of improvements in this area in next gen... do we have data from AMD as well ?

@tedstreete
Copy link
Contributor

tedstreete commented Dec 5, 2022

@seroyer @glimchb The primary issue is that the default behavior when a surprise removal occurs is to crash the OS. OPI need to determine what other behaviors we want the kernel to exhibit and then ensure that the kernel/PCIe subsystem/BIOS/BMC offer those options.

Additionally, while it's not mandatory if DPC events are managed gracefully, I'd argue that an ability to disable the Host/DPU PCIe link during DPU OS install/upgrade is a benefit we should explore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants