Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bootloader hanging during uefi exit_boot_services #457

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Wasabi375
Copy link
Contributor

Fixes #400

Switch from using the "-bios" flag to using "-pflash" instead.
Using "-bios" is not recomended see:
https://lists.katacontainers.io/pipermail/kata-dev/2021-January/001650.html

This also allows us to easily update the ovmf-prebuild binaries.

This fixes #400. I'm not sure whether the issue was an outdated ovmf
version or the bios-pflash change. I can't tell what version is
published with the ovmf-prebuild crate, so I can't check.

Ovmf binaries are no longer provided by the ovmf-prebuild crate.
Instead we now download them from the ovmf-prebuild github releases of
the same crate. Unlike the crate, those are actually updated.

Tests will now fail with a timeout error after 2 minutes. That should
bee more than enough to run any test.
This also switches to the new API of the uefi crate that was
introduced in version 0.31. The old api was deprecated in 0.32 and
will be removed in 0.33.
See https://github.com/rust-osdev/uefi-rs/blob/ac19526656953c32e8e0ef7bc703f7700b151ae4/docs/funcs_migration.md
and rust-osdev/uefi-rs#893
Switch from using the "-bios" flag to using "-pflash" instead.
Using "-bios" is not recomended see:
https://lists.katacontainers.io/pipermail/kata-dev/2021-January/001650.html

This also allows us to easily update the ovmf-prebuild binaries.

This fixes rust-osdev#400. I'm not sure whether the issue was an outdated ovmf
version or the bios-pflash change. I can't tell what version is
published with the ovmf-prebuild crate, so I can't check.

Ovmf binaries are no longer provided by the ovmf-prebuild crate.
Instead we now download them from the ovmf-prebuild github releases of
the same crate. Unlike the crate, those are actually updated.

Fixes rust-osdev#400
@Wasabi375
Copy link
Contributor Author

The test/runner/src/ovmf.rs file is based on https://github.com/rust-osdev/uefi-rs/blob/ac19526656953c32e8e0ef7bc703f7700b151ae4/xtask/src/qemu.rs#L65

I am 99% certain that this fixes the issue in #400. However I want to implement the same fix in my own kernel to make sure.

I also still need to update https://github.com/rust-osdev/bootloader/blob/main/docs/create-disk-image.md

@Wasabi375
Copy link
Contributor Author

Ok, I can confirm that using pflash works on my machine. I don't really understand why the CI tests fails. Maybe someone else can try this fix on their machine and give me feedback. The CI log is not really helpful.

@Wasabi375 Wasabi375 marked this pull request as ready for review September 17, 2024 06:51
@Wasabi375
Copy link
Contributor Author

I wonder if we should provide most of the code in test/runner/src/ovmf.rs as a library. Maybe as an updated version of ovmf-prebuild.
Otherwise this is a lot of code that needs to be copied into any project using the bootloader.

@@ -105,6 +108,8 @@ where
"-display",
"none",
"--no-reboot",
"-smp",
"8",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blog os at least doesn't support more than 1 core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why that matters here. This is just for the tests in this repo and has no impact on any kernel build with this bootloader. Also correct me if I am wrong but Blog Os should work fine on a multicore system. It will use just 1, but the other cores don't do anything unless specifically started by the kernel.

.kill()
.expect("Qemu could not be killed after timeout");
panic!("Test timed out after 2 minutes");
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust's default test runner never times out. At worst it shows a warning after 60s. A timeout means that if you suspend the testing for longer than 120s using ctrl-z, you get an unnnecessary test failure. Also depending on how complex the tests of your OS are you may genuinely need more than 120s on low-end hardware. On local systems you can always kill the test manually and on CI there should be a timeout from the CI provider anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind removing this. I don't think the crtl-z case is common, but maybe that just comes down to preference. I never use that myself.
I mostly added this so I could easily run the tests locally in the background and get notified about failures.
This bug is random in nature so I ran the tests locally in a loop so I needed some kind of failure state.

As for CI the timeout is 30minutes and that seems excessive. Especially when like here the CPU is maxed out by the hanging test.

Also depending on how complex the tests of your OS are you may genuinely need more than 120s on low-end hardware.

That might actually explain the failing CI tests. So maybe I should remove this.

sha2 = "0.10.8"
tar = "0.4.41"
lzma-rs = "0.3.0"
tempfile = "3.12.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this downloading logic should be added to the ovmf-prebuilt repo instead. There is already a stub crate for this in that repo anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Maybe that should be part of a separate PR though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Make a copy of the OVMF vars file so that it can be used
// read+write without modifying the original. Under AArch64, some
// versions of OVMF won't boot if the vars file isn't writeable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reading the file and writing it to a new file would work to avoid an explicit permission change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. I am creating a copy of the original file. Or are you referring to this?

// Necessary, as for example on NixOS, the files are read-only inside
// the Nix store.
#[cfg(target_os = "linux")]
fs_err::set_permissions(&ovmf_vars, Permissions::from_mode(0o666))?;

I copied that from the uefi crates tests. I did not really look into this too much. I assume they had some issues that meant they needed to add this.

Copy link
Contributor

@bjorn3 bjorn3 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use fs::copy you make a copy of the file including pernissions and xattrs. It seems that here a copy of just the file contents ignoring file permissions would be a better option. That can be done by reading the file and writing it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that would be any cleaner though. A simple call to set_permissions is cleaner in my opinion.
Also this way we get some possible advantages from COW file systems, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least make it #[cfg(unix)] then.

tests/runner/Cargo.toml Outdated Show resolved Hide resolved
use {std::fs::Permissions, std::os::unix::fs::PermissionsExt};

/// Name of the ovmf-prebuilt release tag.
const OVMF_PREBUILT_TAG: &str = "edk2-stable202311-r2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty outdated version, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's about a year old. I chose this version as it is the one the uefi crate uses in their tests.
We can probably update this. It just wasn't a priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested the latest tarball version(edk2-stable202405-r1) released to https://github.com/rust-osdev/ovmf-prebuilt.

It breaks the uefi pxe tests. I don't really know enough about pxe to debug this, but it looks like it does not find the bootloader and tries to start a uefi terminal.

@Wasabi375
Copy link
Contributor Author

Do you know how I can get the actual logs from qemu in the CI ? Right now the tests time after 30 minutes but I can't reproduce this locally.


[[package]]
name = "windows-targets"
version = "0.52.6"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now pulling in an extra copy of windows-targets. Maybe update the async-process dependency of bootloader to 2.3.0, which uses windows-targets 0.52.6 too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Boot prcess gets sometimes stuck at "TRACE: exiting boot services" when using Uefi and qemu
4 participants