Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: do not panic if virtio device activation return Err(...) #4665

Merged

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Jul 3, 2024

When the guest driver sets a virtio devices status to DRIVER_OK, we
proceed with calling VirtioDevice::activate. However, our MMIO
transport layer assumes that this activation cannot go wrong, and calls
.expect(...) on the result. For most devices, this is fine, as the
activate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
activate method, which includes communication with the
vhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes activate to return an error, and thus consequently
a panic in the MMIO code.

The virtio spec, in Section 2.2, has the following to say [1]:

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration
change notification to the driver.

So the spec-conform way of handling an activation error is setting
the DEVICE_NEEDS_RESET flag in the device_status field (which is what
this commit does).

This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also #3074. Thus, the device will simply be
"dead" to the guest. But at least Firecracker won't crash anymore.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@roypat roypat changed the title No crash on activate failure fix: do not panic if virtio device activation return Err(...) Jul 3, 2024
Combine the `interrupt_evt` and `interrupt_status` methods into a single
method `interrupt_trigger` that returns a `IrqTrigger` reference (which
essentially combines the two objects originally returned by the status
and evt methods).

The advantage to this is that `IrqTrigger` exposes a `trigger_irq`
method, which I'd like to use in the next commit.

Signed-off-by: Patrick Roy <[email protected]>
Copy link

codecov bot commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 80.64516% with 12 lines in your changes missing coverage. Please review.

Project coverage is 82.11%. Comparing base (dc17a23) to head (8eb2229).

Files Patch % Lines
src/vmm/src/devices/virtio/vsock/device.rs 28.57% 5 Missing ⚠️
src/vmm/src/devices/virtio/block/virtio/device.rs 50.00% 2 Missing ⚠️
src/vmm/src/devices/virtio/net/device.rs 50.00% 2 Missing ⚠️
src/vmm/src/devices/virtio/balloon/device.rs 66.66% 1 Missing ⚠️
src/vmm/src/devices/virtio/block/device.rs 66.66% 1 Missing ⚠️
src/vmm/src/devices/virtio/rng/device.rs 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4665      +/-   ##
==========================================
+ Coverage   82.08%   82.11%   +0.02%     
==========================================
  Files         255      255              
  Lines       31267    31261       -6     
==========================================
+ Hits        25666    25670       +4     
+ Misses       5601     5591      -10     
Flag Coverage Δ
4.14-c5n.metal 79.61% <80.64%> (+0.03%) ⬆️
4.14-m5n.metal 79.59% <80.64%> (+0.03%) ⬆️
4.14-m6a.metal 78.81% <80.64%> (+0.02%) ⬆️
4.14-m6g.metal 76.63% <80.64%> (+0.03%) ⬆️
4.14-m6i.metal 79.59% <80.64%> (+0.03%) ⬆️
4.14-m7g.metal 76.63% <80.64%> (+0.03%) ⬆️
5.10-c5n.metal 82.12% <80.64%> (+0.03%) ⬆️
5.10-m5n.metal 82.11% <80.64%> (+0.03%) ⬆️
5.10-m6a.metal 81.41% <80.64%> (+0.02%) ⬆️
5.10-m6g.metal 79.41% <80.64%> (+0.03%) ⬆️
5.10-m6i.metal 82.10% <80.64%> (+0.03%) ⬆️
5.10-m7g.metal 79.41% <80.64%> (+0.03%) ⬆️
6.1-c5n.metal 82.12% <80.64%> (+0.03%) ⬆️
6.1-m5n.metal 82.11% <80.64%> (+0.02%) ⬆️
6.1-m6a.metal 81.41% <80.64%> (+0.03%) ⬆️
6.1-m6g.metal 79.40% <80.64%> (+0.03%) ⬆️
6.1-m6i.metal 82.10% <80.64%> (+0.02%) ⬆️
6.1-m7g.metal 79.41% <80.64%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Correctly mark the doc comment as an inner doc comment for the enclosing
macro, instead of an outer doc comment for the second import.

Signed-off-by: Patrick Roy <[email protected]>
When the guest driver sets a virtio devices status to `DRIVER_OK`, we
proceed with calling `VirtioDevice::activate`. However, our MMIO
transport layer assumes that this activation cannot go wrong, and calls
`.expect(...)` on the result. For most devices, this is fine, as the
activate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
`activate` method, which includes communication with the
vhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes `activate` to return an error, and thus consequently
a panic in the MMIO code.

The virtio spec, in Section 2.2, has the following to say [1]:

> The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
  that a reset is needed. If DRIVER_OK is set, after it sets
  DEVICE_NEEDS_RESET, the device MUST send a device configuration
  change notification to the driver.

So the spec-conform way of handling an activation error is setting
the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what
this commit does).

This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also firecracker-microvm#3074. Thus, the device will simply be
"dead" to the guest: All operations where we currently simply abort and
do nothing if the device is in the FAILED state will do the same in the
DEVICE_NEEDS_RESET state.

[1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf

Signed-off-by: Patrick Roy <[email protected]>
Add a unittest to deal with the case where virtio device activation
fails. In this case, the device state needs to be put to
DEVICE_NEEDS_RESET, and an interrupt should have been generated.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the no-crash-on-activate-failure branch from 776b969 to 53338f7 Compare July 3, 2024 17:04
@roypat roypat marked this pull request as ready for review July 3, 2024 17:05
Copy link
Contributor

@kalyazin kalyazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable)

When you speak about recovery, do you mean Firecracker will need to try to reconnect to the backend when the device is reset? This isn't going to solve the problem for test_vhost_user_block_disconnect where the backend never comes back though?

src/vmm/src/devices/virtio/mmio.rs Outdated Show resolved Hide resolved
src/vmm/src/devices/virtio/mmio.rs Outdated Show resolved Hide resolved
@roypat
Copy link
Contributor Author

roypat commented Jul 4, 2024

however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable)

When you speak about recovery, do you mean Firecracker will need to try to reconnect to the backend when the device is reset? This isn't going to solve the problem for test_vhost_user_block_disconnect where the backend never comes back though?

Yeah, it won't solve the problem where the backend never comes online, I would expect in this case the device will just remain in either DEVICE_NEEDS_RESET or FAILED indefinitely. It's hard to tell, because Linux actually also does not implement this part of the virtio spec, lol. I'd expect a well-behaving guest to not retry indefinitely, and just mark the device as FAILED after, say, 3 failed resets.

Log activation failures at the only call-site of `activate`, instead of
inside each individual `activate` function. For this, untangle some of
the `ActivationError` variants - `BadActivate` was almost exclusively
used in the case where writing to the activation eventfd failed, except
in the vsock device, where it was also used to indicate that the number
of queues the guest gave us was wrong (which this commit factors out
into its own error variant).

Signed-off-by: Patrick Roy <[email protected]>
Make sure that whenever the activation of a virtio device fails, we set
the `activate_fails` metric.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the no-crash-on-activate-failure branch from e97b8ef to 8eb2229 Compare July 4, 2024 12:56
@roypat roypat merged commit 5b74fef into firecracker-microvm:main Jul 4, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants