Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need a way to trigger cleanup and next steps for vanished instances #4872

Closed
davepacheco opened this issue Jan 23, 2024 · 7 comments
Closed
Assignees
Labels
known issue To include in customer documentation and training virtualization Propolis Integration & VM Management
Milestone

Comments

@davepacheco
Copy link
Collaborator

There are various cases where the control plane might discover that Instances that we thought were running are, in fact, not. Examples (from concrete to speculative):

  • when Nexus is notified by Sled Agent (in turn notified from Propolis) that a guest is no longer running (because it was stopped from inside the guest)
  • when a sled reboots, its instances are all gone and this process needs to be applied to each of them (this is restart customer Instances after sled reboot #3633)
  • when we implement support for (ungraceful) sled removal, we'll need to apply this to all of the instances that we thought were on the system that's been removed
  • we probably want some periodic task that asks Sled Agent what instances it's running and compares that to what we think it should be running
  • we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

Relaying some notes from @gjcolombo (my apologies where I've got these details wrong):

  • this code path exists today where we handle the first case above (where Nexus handles a message from Sled Agent saying the instance is no longer running)
  • this code path has some issues and is not well factored use in other contexts like we need here
  • all of this would be reworked by Tracking: Instance Lifecycle Overhaul #3742
  • the "cleanup" referred to here includes: foreign key from instance to vmm table needs to be reaped; networking state needs to be cleaned up; provisioning counters and sled resources need to be cleaned up; etc.
  • the "next steps" referred to here include: looking at some per-instance "fault discipline" (currently the "boot_on_fault" boolean) to determine if we should start the instance running again somewhere else and, if so, doing that

We may view this issue as a dup of #3742 but I wanted to file this separate ticket to reflect more precisely what's needed for sled removal. The implementation may well just be #3742.

@davepacheco davepacheco added the virtualization Propolis Integration & VM Management label Jan 23, 2024
@gjcolombo
Copy link
Contributor

#4226 is also tangentially related. Today Nexus uses the "Failed" instance state to mean, more or less, "something went wrong during some operation on this instance," and the only recovery action we support is deleting the instance (see #2825). #4226 proposes distinguishing between different values of "something" so that we can recover appropriately: transient error conditions might resolve on retry, but if a sled reports that a VMM is definitely gone then the error is permanent.

4226 could possibly be resolved without handling cases like "the operator says this sled is gone, so all the living VMMs there should immediately be marked as gone," so these issues aren't duplicates. But I think they're all dancing around a single concept--Nexus should have an idea of a VMM's "health," which I think could start out along these lines:

  • this VMM was healthy the last time we checked on it
  • we weren't able to ascertain whether the VMM was healthy when we last checked on it (e.g. maybe it's running fine and its sled agent is borked?)
  • this VMM is definitely dead (sled agent returned an empty VMM zone list, the operator said the sled was pulled, etc.)

Then there should be some code path that periodically deals with "definitely-dead" VMMs that are still pointed to by their Instances, as discussed above, and the "rudely remove a sled" API would just need to mark the target VMMs as "definitely stopped, did not gracefully shut down."

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024
@askfongjojo askfongjojo added this to the 8 milestone Mar 9, 2024
@askfongjojo askfongjojo modified the milestones: 8, 9 Apr 24, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 1, 2024
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 13, 2024
gjcolombo added a commit that referenced this issue Aug 27, 2024
Change sled agent's instance lookup tables so that Propolis jobs are
indexed by Propolis/VMM IDs instead of instance IDs.
This is a prerequisite to revisiting how the Failed instance state
works. See RFD 486 section 6.1 for all the details of why this is
needed, but very broadly: when an instance's VMM is Destroyed, we'd like
sled agent to tell Nexus that *before* the agent deregisters the
instance from the sled, for reasons described in the RFD; but if we do
that with no other changes, there's a race where Nexus may try to
restart the instance on the same sled before sled agent can update its
instance table, causing instance start to fail.

To achieve this:

- In sled agent, change the `InstanceManagerRunner`'s instance map to a
`BTreeMap<PropolisUuid, Instance>`, then clean up all the compilation
errors.
- In Nexus:
- Make callers of instance APIs furnish a Propolis ID instead of an
instance ID. This is generally very straightforward because they already
had to get a VMM record to figure out what sled to talk to.
- Change `cpapi_instances_put` to take a Propolis ID instead of an
instance ID. Regular sled agent still has both IDs, but with these
changes, simulated sled agents only have a Propolis ID to work with, and
plumbing an instance ID down to them requires significant code changes.
- Update test code:
- Unify the Nexus helper routines that let integration tests get sled
agent clients or sled IDs; now they get a single struct containing both
of those and the instance's Propolis IDs.
- Update users of the simulated agent's `poke` endpoints to use Propolis
IDs.
- Delete the "detach disks on instance stop" bits of simulated sled
agent. These don't appear to be load-bearing, they don't correspond to
any behavior in the actual sled agent (which doesn't manage disk
attachment or detachment), and it was a pain to rework them to work with
Propolis IDs.

Tests: cargo nextest.

Related: #4226 and #4872, among others.
@hawkw hawkw self-assigned this Sep 2, 2024
@hawkw
Copy link
Member

hawkw commented Sep 2, 2024

PR #6455 will move instances to InstanceState::Failed when the sled-agent believes that the instance's active VMM is no longer present. This happens when a sled reboots (as well as if the instance's VMM crashes). That PR changes the semantics of the Failed state so that Failed instances are now eligible to be restarted or deleted, and ensures that the sled resources and virtual provisioning resources allocated to such instances are cleaned up by an instance-update saga before the instance is transitioned to Failed.

Transitions to Failed are triggered by both API requests that attempt to change the state of an instance that's no longer there, and the instance-watcher background task added in #5611 when a sled-agent definitively indicates that it no longer knows about the instance's VMM. So, that checks off

  • we probably want some periodic task that asks Sled Agent what instances it's running and compares that to what we think it should be running

We don't currently transition instances to Failed in the face of communication errors, as described in

  • we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

as that kind of health checking is a bit more complex, but we could certainly start transitioning instances to Failed once we start doing sled-level health checks.

Once #6455 is merged, PR #6499 refactors instance.boot_on_fault into an enum to allow more expressive auto-restart policies, and #6503 adds a background task for automatically restarting Failed instances if the auto-restart policy allows it. By default, instances are not currently allowed to be automatically restarted; we'll probably want to wire up a way for users to specify this policy when creating instances.

hawkw added a commit that referenced this issue Sep 2, 2024
[RFD 486] describes a principled design for how to use the Failed state
for instances and VMMs. In particular, it proposes that:

- Instances and VMMs should move to the `Failed` state independently.
- A VMM goes to `Failed` if either:
  + Nexus observes that a VMM it thought was resident on a sled is no
    longer resident there, or,
  + it observes through some other means (sled expungement, sled reboot)
    that all VMMs that were resident on that sled must now be gone.
- Instances go to `Failed` when an `instance-update` saga observes that
  an instance's active VMM has been marked `Failed`. An update saga
  transitioning an instance to failed cleans up the instance's resource
  allocations, similarly to how update sagas handle transitioning an
  instance to `Destroyed` when its active VMM has been destroyed.
- For purposes of the external API:
  + An instance that is itself in the `Failed` state appears externally
    to be `Failed`.
  + If an instance in the "has-VMM" internal state and points to a
    `Failed` VMM, the instance is `Stopping` and then transitions to
    `Failed`.
  + Once the actual instance (not just its active VMM) is `Failed`, it
    can be started or destroyed.

This branch implements the behavior described above.

In particular, I've added code to the `instance-update` saga to handle
instances whose active or migration target VMMs have transitioned to
`Failed`, similarly to how it handles the VMMs transitioning to
`Destroyed`. I've changed the code that detects whether a sled-agent
error indicates that an instance has failed to only do so when the error
is a 404 with the specific error code that indicates that the sled-agent
has forgotten about a VMM that it previously knew about, and changed the
Nexus functions that ask sled-agents to  update an instance's state, and
the `instance_watcher` background task, to mark instances as `Failed`
when they encounter only the appropriate sled-agent error. And, I've
changed the `mark_instance_failed` function in Nexus to instead mark the
VMM record as failed and trigger an instance-update saga (and it's now
named `mark_vmm_failed`), and made similar changes to the
`instance_watcher` background task. Finally, I've also made sure that
`Failed` instances can be destroyed and restarted, and that instances
with `Failed` active VMMs appear to be `Stopping` until the VMM is
cleaned up by an update saga.

In addition to this, it was necessary to (re)introduce a
`VmmState::Creating` variant. The purpose of this variant is to
differentiate between VMMs which have been created in the database but
do not yet exist on a sled, from those which are known to sled-agent and
are actually starting/migrating.

This is necessary because presently, the `instance_watcher` background
task will attempt to perform health checks for `Starting` and
`Migrating` VMMs, and --- if the sled-agent does not actually know about
the VMM yet --- will move it to `Failed`. But, when a VMM is newly
created, the sled-agent won't know about it yet until it has been
registered with the sled-agent, so if the `instance_watcher` task
activates while an `instance-start` or `instance-migrate` saga is in
flight, the `instance-watcher` task may incorrectly move its VMM to
`Failed` if the VMM record has been created but not yet registered. Now,
after adding the `Creating` variant, we can avoid performing health
checks for those VMMs until the sled-agent actually acknowledges a
request to register the VMM.

Besides the behavior implemented here, RFD 486 also proposes that the
`boot_on_fault` field in the instance record be used by Nexus to
determine whether to automatically attempt to restart a `Failed`
instance. This is described in issues #6491and #4872. I'm planning
to implement this in a separate branch, PR #6503.

[RFD 486]: https://rfd.shared.oxide.computer/rfd/0486#_determinations
@davepacheco
Copy link
Collaborator Author

@hawkw That all makes sense. So in terms of the main ask in this ticket: when Reconfigurator expunges a sled and wants to clean up all those instances, what do we do? Would it be enough to activate the instance-watcher task? Would that task correctly handle instances whose database state shows a vmm on a sled that is now expunged?

@hawkw
Copy link
Member

hawkw commented Sep 3, 2024

@hawkw That all makes sense. So in terms of the main ask in this ticket: when Reconfigurator expunges a sled and wants to clean up all those instances, what do we do? Would it be enough to activate the instance-watcher task? Would that task correctly handle instances whose database state shows a vmm on a sled that is now expunged?

The instance-watcher task does not presently handle that case. At present, it only marks instances as Failed when the sled-agent for that instance responds with an error indicating that it definitively does not believe it's supposed to have that instance (which is what would happen if the sled rebooted, or if the Propolis went away).

Right now, the query that the instance-watcher task uses to find sleds to health-check selects only InService sleds:

.sled_filter(SledFilter::InService)

However, we could pretty easily remove that, and we could change the task to check the state of the sled before starting a health check, and if the sled has been expunged, just go ahead and mark all its VMMs as Failed instead of actually sending a HTTP request to the sled-agent. I'd be happy to go do that!

@davepacheco
Copy link
Collaborator Author

Whether or not we consider this done, I think the pieces we were tracking for add/remove sled are done (#6519, #6503) so I'm removing this from the Reconfigurator board.

@hawkw
Copy link
Member

hawkw commented Sep 24, 2024

I think that this is done, modulo:

  • we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

But, that seems like a substantial enough chunk of work that it might be better considered its own project (and perhaps deserves an RFD?). As far as explicit expungement, I'm for marking this one as done and moving forward potential automated reboots of disappeared sleds into its own thing.

@hawkw
Copy link
Member

hawkw commented Sep 26, 2024

I'm gonna close this out, I think that that detecting unresponsive sleds seems like it'll be part of the ongoing FMA/health-checking work.

@hawkw hawkw closed this as completed Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training virtualization Propolis Integration & VM Management
Projects
None yet
Development

No branches or pull requests

5 participants