Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: do not panic if virtio device activation return Err(...) #4665

Merged

Commits on Jul 3, 2024

  1. refactor: Simplify VirtioDevice trait by combining methods

    Combine the `interrupt_evt` and `interrupt_status` methods into a single
    method `interrupt_trigger` that returns a `IrqTrigger` reference (which
    essentially combines the two objects originally returned by the status
    and evt methods).
    
    The advantage to this is that `IrqTrigger` exposes a `trigger_irq`
    method, which I'd like to use in the next commit.
    
    Signed-off-by: Patrick Roy <[email protected]>
    roypat committed Jul 3, 2024
    Configuration menu
    Copy the full SHA
    15e0c73 View commit details
    Browse the repository at this point in the history
  2. doc: fix doc comment in vsock/device.rs

    Correctly mark the doc comment as an inner doc comment for the enclosing
    macro, instead of an outer doc comment for the second import.
    
    Signed-off-by: Patrick Roy <[email protected]>
    roypat committed Jul 3, 2024
    Configuration menu
    Copy the full SHA
    7c03f82 View commit details
    Browse the repository at this point in the history
  3. fix: do not panic if virtio device activation return Err(...)

    When the guest driver sets a virtio devices status to `DRIVER_OK`, we
    proceed with calling `VirtioDevice::activate`. However, our MMIO
    transport layer assumes that this activation cannot go wrong, and calls
    `.expect(...)` on the result. For most devices, this is fine, as the
    activate method doesn't do much besides writing to an event_fd (and I
    can't think of a scenario in which this could go wrong). However, our
    vhost-user-blk device has some non-trivial logic inside of its
    `activate` method, which includes communication with the
    vhost-user-backend via a unix socket. If this unix socket gets closed
    early, this causes `activate` to return an error, and thus consequently
    a panic in the MMIO code.
    
    The virtio spec, in Section 2.2, has the following to say [1]:
    
    > The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
      that a reset is needed. If DRIVER_OK is set, after it sets
      DEVICE_NEEDS_RESET, the device MUST send a device configuration
      change notification to the driver.
    
    So the spec-conform way of handling an activation error is setting
    the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what
    this commit does).
    
    This will fix the panic, however it will most certainly still not result
    in correct device operations (e.g. a vhost-user-backend dying will still
    be unrecoverable). This is because Firecracker does not actually
    implement device reset, see also firecracker-microvm#3074. Thus, the device will simply be
    "dead" to the guest: All operations where we currently simply abort and
    do nothing if the device is in the FAILED state will do the same in the
    DEVICE_NEEDS_RESET state.
    
    [1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf
    
    Signed-off-by: Patrick Roy <[email protected]>
    roypat committed Jul 3, 2024
    Configuration menu
    Copy the full SHA
    297db95 View commit details
    Browse the repository at this point in the history
  4. test: ensure virtio device activation failure is handled correctly

    Add a unittest to deal with the case where virtio device activation
    fails. In this case, the device state needs to be put to
    DEVICE_NEEDS_RESET, and an interrupt should have been generated.
    
    Signed-off-by: Patrick Roy <[email protected]>
    roypat committed Jul 3, 2024
    Configuration menu
    Copy the full SHA
    53338f7 View commit details
    Browse the repository at this point in the history

Commits on Jul 4, 2024

  1. refactor: centralize activation failure logging

    Log activation failures at the only call-site of `activate`, instead of
    inside each individual `activate` function. For this, untangle some of
    the `ActivationError` variants - `BadActivate` was almost exclusively
    used in the case where writing to the activation eventfd failed, except
    in the vsock device, where it was also used to indicate that the number
    of queues the guest gave us was wrong (which this commit factors out
    into its own error variant).
    
    Signed-off-by: Patrick Roy <[email protected]>
    roypat committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    057d0d8 View commit details
    Browse the repository at this point in the history
  2. fix: insert missing activate_fails metrics increases

    Make sure that whenever the activation of a virtio device fails, we set
    the `activate_fails` metric.
    
    Signed-off-by: Patrick Roy <[email protected]>
    roypat committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    8eb2229 View commit details
    Browse the repository at this point in the history