fix: do not panic if virtio device activation return Err(...) #4665

roypat · 2024-07-03T16:32:29Z

When the guest driver sets a virtio devices status to DRIVER_OK, we
proceed with calling VirtioDevice::activate. However, our MMIO
transport layer assumes that this activation cannot go wrong, and calls
.expect(...) on the result. For most devices, this is fine, as the
activate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
activate method, which includes communication with the
vhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes activate to return an error, and thus consequently
a panic in the MMIO code.

The virtio spec, in Section 2.2, has the following to say [1]:

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration
change notification to the driver.

So the spec-conform way of handling an activation error is setting
the DEVICE_NEEDS_RESET flag in the device_status field (which is what
this commit does).

This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also #3074. Thus, the device will simply be
"dead" to the guest. But at least Firecracker won't crash anymore.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this
PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet
contribution quality standards.

This functionality cannot be added in rust-vmm.

Combine the `interrupt_evt` and `interrupt_status` methods into a single method `interrupt_trigger` that returns a `IrqTrigger` reference (which essentially combines the two objects originally returned by the status and evt methods). The advantage to this is that `IrqTrigger` exposes a `trigger_irq` method, which I'd like to use in the next commit. Signed-off-by: Patrick Roy <[email protected]>

codecov · 2024-07-03T16:46:44Z

Codecov Report

Attention: Patch coverage is 80.64516% with 12 lines in your changes missing coverage. Please review.

Project coverage is 82.11%. Comparing base (dc17a23) to head (8eb2229).

Files	Patch %	Lines
src/vmm/src/devices/virtio/vsock/device.rs	28.57%	5 Missing ⚠️
src/vmm/src/devices/virtio/block/virtio/device.rs	50.00%	2 Missing ⚠️
src/vmm/src/devices/virtio/net/device.rs	50.00%	2 Missing ⚠️
src/vmm/src/devices/virtio/balloon/device.rs	66.66%	1 Missing ⚠️
src/vmm/src/devices/virtio/block/device.rs	66.66%	1 Missing ⚠️
src/vmm/src/devices/virtio/rng/device.rs	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4665      +/-   ##
==========================================
+ Coverage   82.08%   82.11%   +0.02%     
==========================================
  Files         255      255              
  Lines       31267    31261       -6     
==========================================
+ Hits        25666    25670       +4     
+ Misses       5601     5591      -10

Flag	Coverage Δ
4.14-c5n.metal	`79.61% <80.64%> (+0.03%)`	⬆️
4.14-m5n.metal	`79.59% <80.64%> (+0.03%)`	⬆️
4.14-m6a.metal	`78.81% <80.64%> (+0.02%)`	⬆️
4.14-m6g.metal	`76.63% <80.64%> (+0.03%)`	⬆️
4.14-m6i.metal	`79.59% <80.64%> (+0.03%)`	⬆️
4.14-m7g.metal	`76.63% <80.64%> (+0.03%)`	⬆️
5.10-c5n.metal	`82.12% <80.64%> (+0.03%)`	⬆️
5.10-m5n.metal	`82.11% <80.64%> (+0.03%)`	⬆️
5.10-m6a.metal	`81.41% <80.64%> (+0.02%)`	⬆️
5.10-m6g.metal	`79.41% <80.64%> (+0.03%)`	⬆️
5.10-m6i.metal	`82.10% <80.64%> (+0.03%)`	⬆️
5.10-m7g.metal	`79.41% <80.64%> (+0.03%)`	⬆️
6.1-c5n.metal	`82.12% <80.64%> (+0.03%)`	⬆️
6.1-m5n.metal	`82.11% <80.64%> (+0.02%)`	⬆️
6.1-m6a.metal	`81.41% <80.64%> (+0.03%)`	⬆️
6.1-m6g.metal	`79.40% <80.64%> (+0.03%)`	⬆️
6.1-m6i.metal	`82.10% <80.64%> (+0.02%)`	⬆️
6.1-m7g.metal	`79.41% <80.64%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Correctly mark the doc comment as an inner doc comment for the enclosing macro, instead of an outer doc comment for the second import. Signed-off-by: Patrick Roy <[email protected]>

When the guest driver sets a virtio devices status to `DRIVER_OK`, we proceed with calling `VirtioDevice::activate`. However, our MMIO transport layer assumes that this activation cannot go wrong, and calls `.expect(...)` on the result. For most devices, this is fine, as the activate method doesn't do much besides writing to an event_fd (and I can't think of a scenario in which this could go wrong). However, our vhost-user-blk device has some non-trivial logic inside of its `activate` method, which includes communication with the vhost-user-backend via a unix socket. If this unix socket gets closed early, this causes `activate` to return an error, and thus consequently a panic in the MMIO code. The virtio spec, in Section 2.2, has the following to say [1]: > The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device MUST send a device configuration change notification to the driver. So the spec-conform way of handling an activation error is setting the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what this commit does). This will fix the panic, however it will most certainly still not result in correct device operations (e.g. a vhost-user-backend dying will still be unrecoverable). This is because Firecracker does not actually implement device reset, see also firecracker-microvm#3074. Thus, the device will simply be "dead" to the guest: All operations where we currently simply abort and do nothing if the device is in the FAILED state will do the same in the DEVICE_NEEDS_RESET state. [1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf Signed-off-by: Patrick Roy <[email protected]>

Add a unittest to deal with the case where virtio device activation fails. In this case, the device state needs to be put to DEVICE_NEEDS_RESET, and an interrupt should have been generated. Signed-off-by: Patrick Roy <[email protected]>

kalyazin

however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable)

When you speak about recovery, do you mean Firecracker will need to try to reconnect to the backend when the device is reset? This isn't going to solve the problem for test_vhost_user_block_disconnect where the backend never comes back though?

src/vmm/src/devices/virtio/mmio.rs

roypat · 2024-07-04T09:55:35Z

however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable)

When you speak about recovery, do you mean Firecracker will need to try to reconnect to the backend when the device is reset? This isn't going to solve the problem for test_vhost_user_block_disconnect where the backend never comes back though?

Yeah, it won't solve the problem where the backend never comes online, I would expect in this case the device will just remain in either DEVICE_NEEDS_RESET or FAILED indefinitely. It's hard to tell, because Linux actually also does not implement this part of the virtio spec, lol. I'd expect a well-behaving guest to not retry indefinitely, and just mark the device as FAILED after, say, 3 failed resets.

src/vmm/src/devices/virtio/mmio.rs

Log activation failures at the only call-site of `activate`, instead of inside each individual `activate` function. For this, untangle some of the `ActivationError` variants - `BadActivate` was almost exclusively used in the case where writing to the activation eventfd failed, except in the vsock device, where it was also used to indicate that the number of queues the guest gave us was wrong (which this commit factors out into its own error variant). Signed-off-by: Patrick Roy <[email protected]>

Make sure that whenever the activation of a virtio device fails, we set the `activate_fails` metric. Signed-off-by: Patrick Roy <[email protected]>

roypat changed the title ~~No crash on activate failure~~ fix: do not panic if virtio device activation return Err(...) Jul 3, 2024

roypat added 3 commits July 3, 2024 17:50

doc: fix doc comment in vsock/device.rs

7c03f82

Correctly mark the doc comment as an inner doc comment for the enclosing macro, instead of an outer doc comment for the second import. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the no-crash-on-activate-failure branch from 776b969 to 53338f7 Compare July 3, 2024 17:04

roypat marked this pull request as ready for review July 3, 2024 17:05

zulinx86 requested review from kalyazin and ShadowCurse July 4, 2024 08:52

kalyazin reviewed Jul 4, 2024

View reviewed changes

src/vmm/src/devices/virtio/mmio.rs Outdated Show resolved Hide resolved

src/vmm/src/devices/virtio/mmio.rs Outdated Show resolved Hide resolved

ShadowCurse reviewed Jul 4, 2024

View reviewed changes

src/vmm/src/devices/virtio/mmio.rs Outdated Show resolved Hide resolved

ShadowCurse reviewed Jul 4, 2024

View reviewed changes

src/vmm/src/devices/virtio/mmio.rs Outdated Show resolved Hide resolved

roypat added 2 commits July 4, 2024 13:56

fix: insert missing activate_fails metrics increases

8eb2229

Make sure that whenever the activation of a virtio device fails, we set the `activate_fails` metric. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the no-crash-on-activate-failure branch from e97b8ef to 8eb2229 Compare July 4, 2024 12:56

ShadowCurse approved these changes Jul 4, 2024

View reviewed changes

kalyazin approved these changes Jul 4, 2024

View reviewed changes

roypat merged commit 5b74fef into firecracker-microvm:main Jul 4, 2024
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: do not panic if virtio device activation return Err(...) #4665

fix: do not panic if virtio device activation return Err(...) #4665

roypat commented Jul 3, 2024 •

edited

Loading

codecov bot commented Jul 3, 2024 •

edited

Loading

kalyazin left a comment

roypat commented Jul 4, 2024

fix: do not panic if virtio device activation return Err(...) #4665

fix: do not panic if virtio device activation return Err(...) #4665

Conversation

roypat commented Jul 3, 2024 • edited Loading

License Acceptance

PR Checklist

codecov bot commented Jul 3, 2024 • edited Loading

Codecov Report

kalyazin left a comment

Choose a reason for hiding this comment

roypat commented Jul 4, 2024

roypat commented Jul 3, 2024 •

edited

Loading

codecov bot commented Jul 3, 2024 •

edited

Loading