need a way to trigger cleanup and next steps for vanished instances #4872

davepacheco · 2024-01-23T19:55:40Z

There are various cases where the control plane might discover that Instances that we thought were running are, in fact, not. Examples (from concrete to speculative):

when Nexus is notified by Sled Agent (in turn notified from Propolis) that a guest is no longer running (because it was stopped from inside the guest)
when a sled reboots, its instances are all gone and this process needs to be applied to each of them (this is restart customer Instances after sled reboot #3633)
when we implement support for (ungraceful) sled removal, we'll need to apply this to all of the instances that we thought were on the system that's been removed
we probably want some periodic task that asks Sled Agent what instances it's running and compares that to what we think it should be running
we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

Relaying some notes from @gjcolombo (my apologies where I've got these details wrong):

this code path exists today where we handle the first case above (where Nexus handles a message from Sled Agent saying the instance is no longer running)
this code path has some issues and is not well factored use in other contexts like we need here
all of this would be reworked by Tracking: Instance Lifecycle Overhaul #3742
the "cleanup" referred to here includes: foreign key from instance to vmm table needs to be reaped; networking state needs to be cleaned up; provisioning counters and sled resources need to be cleaned up; etc.
the "next steps" referred to here include: looking at some per-instance "fault discipline" (currently the "boot_on_fault" boolean) to determine if we should start the instance running again somewhere else and, if so, doing that

We may view this issue as a dup of #3742 but I wanted to file this separate ticket to reflect more precisely what's needed for sled removal. The implementation may well just be #3742.

gjcolombo · 2024-01-23T20:46:05Z

#4226 is also tangentially related. Today Nexus uses the "Failed" instance state to mean, more or less, "something went wrong during some operation on this instance," and the only recovery action we support is deleting the instance (see #2825). #4226 proposes distinguishing between different values of "something" so that we can recover appropriately: transient error conditions might resolve on retry, but if a sled reports that a VMM is definitely gone then the error is permanent.

4226 could possibly be resolved without handling cases like "the operator says this sled is gone, so all the living VMMs there should immediately be marked as gone," so these issues aren't duplicates. But I think they're all dancing around a single concept--Nexus should have an idea of a VMM's "health," which I think could start out along these lines:

this VMM was healthy the last time we checked on it
we weren't able to ascertain whether the VMM was healthy when we last checked on it (e.g. maybe it's running fine and its sled agent is borked?)
this VMM is definitely dead (sled agent returned an empty VMM zone list, the operator said the sled was pulled, etc.)

Then there should be some code path that periodically deals with "definitely-dead" VMMs that are still pointed to by their Instances, as discussed above, and the "rudely remove a sled" API would just need to mark the target VMMs as "definitely stopped, did not gracefully shut down."

Change sled agent's instance lookup tables so that Propolis jobs are indexed by Propolis/VMM IDs instead of instance IDs. This is a prerequisite to revisiting how the Failed instance state works. See RFD 486 section 6.1 for all the details of why this is needed, but very broadly: when an instance's VMM is Destroyed, we'd like sled agent to tell Nexus that *before* the agent deregisters the instance from the sled, for reasons described in the RFD; but if we do that with no other changes, there's a race where Nexus may try to restart the instance on the same sled before sled agent can update its instance table, causing instance start to fail. To achieve this: - In sled agent, change the `InstanceManagerRunner`'s instance map to a `BTreeMap<PropolisUuid, Instance>`, then clean up all the compilation errors. - In Nexus: - Make callers of instance APIs furnish a Propolis ID instead of an instance ID. This is generally very straightforward because they already had to get a VMM record to figure out what sled to talk to. - Change `cpapi_instances_put` to take a Propolis ID instead of an instance ID. Regular sled agent still has both IDs, but with these changes, simulated sled agents only have a Propolis ID to work with, and plumbing an instance ID down to them requires significant code changes. - Update test code: - Unify the Nexus helper routines that let integration tests get sled agent clients or sled IDs; now they get a single struct containing both of those and the instance's Propolis IDs. - Update users of the simulated agent's `poke` endpoints to use Propolis IDs. - Delete the "detach disks on instance stop" bits of simulated sled agent. These don't appear to be load-bearing, they don't correspond to any behavior in the actual sled agent (which doesn't manage disk attachment or detachment), and it was a pain to rework them to work with Propolis IDs. Tests: cargo nextest. Related: #4226 and #4872, among others.

hawkw · 2024-09-02T17:23:10Z

PR #6455 will move instances to InstanceState::Failed when the sled-agent believes that the instance's active VMM is no longer present. This happens when a sled reboots (as well as if the instance's VMM crashes). That PR changes the semantics of the Failed state so that Failed instances are now eligible to be restarted or deleted, and ensures that the sled resources and virtual provisioning resources allocated to such instances are cleaned up by an instance-update saga before the instance is transitioned to Failed.

Transitions to Failed are triggered by both API requests that attempt to change the state of an instance that's no longer there, and the instance-watcher background task added in #5611 when a sled-agent definitively indicates that it no longer knows about the instance's VMM. So, that checks off

we probably want some periodic task that asks Sled Agent what instances it's running and compares that to what we think it should be running

We don't currently transition instances to Failed in the face of communication errors, as described in

we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

as that kind of health checking is a bit more complex, but we could certainly start transitioning instances to Failed once we start doing sled-level health checks.

Once #6455 is merged, PR #6499 refactors instance.boot_on_fault into an enum to allow more expressive auto-restart policies, and #6503 adds a background task for automatically restarting Failed instances if the auto-restart policy allows it. By default, instances are not currently allowed to be automatically restarted; we'll probably want to wire up a way for users to specify this policy when creating instances.

[RFD 486] describes a principled design for how to use the Failed state for instances and VMMs. In particular, it proposes that: - Instances and VMMs should move to the `Failed` state independently. - A VMM goes to `Failed` if either: + Nexus observes that a VMM it thought was resident on a sled is no longer resident there, or, + it observes through some other means (sled expungement, sled reboot) that all VMMs that were resident on that sled must now be gone. - Instances go to `Failed` when an `instance-update` saga observes that an instance's active VMM has been marked `Failed`. An update saga transitioning an instance to failed cleans up the instance's resource allocations, similarly to how update sagas handle transitioning an instance to `Destroyed` when its active VMM has been destroyed. - For purposes of the external API: + An instance that is itself in the `Failed` state appears externally to be `Failed`. + If an instance in the "has-VMM" internal state and points to a `Failed` VMM, the instance is `Stopping` and then transitions to `Failed`. + Once the actual instance (not just its active VMM) is `Failed`, it can be started or destroyed. This branch implements the behavior described above. In particular, I've added code to the `instance-update` saga to handle instances whose active or migration target VMMs have transitioned to `Failed`, similarly to how it handles the VMMs transitioning to `Destroyed`. I've changed the code that detects whether a sled-agent error indicates that an instance has failed to only do so when the error is a 404 with the specific error code that indicates that the sled-agent has forgotten about a VMM that it previously knew about, and changed the Nexus functions that ask sled-agents to update an instance's state, and the `instance_watcher` background task, to mark instances as `Failed` when they encounter only the appropriate sled-agent error. And, I've changed the `mark_instance_failed` function in Nexus to instead mark the VMM record as failed and trigger an instance-update saga (and it's now named `mark_vmm_failed`), and made similar changes to the `instance_watcher` background task. Finally, I've also made sure that `Failed` instances can be destroyed and restarted, and that instances with `Failed` active VMMs appear to be `Stopping` until the VMM is cleaned up by an update saga. In addition to this, it was necessary to (re)introduce a `VmmState::Creating` variant. The purpose of this variant is to differentiate between VMMs which have been created in the database but do not yet exist on a sled, from those which are known to sled-agent and are actually starting/migrating. This is necessary because presently, the `instance_watcher` background task will attempt to perform health checks for `Starting` and `Migrating` VMMs, and --- if the sled-agent does not actually know about the VMM yet --- will move it to `Failed`. But, when a VMM is newly created, the sled-agent won't know about it yet until it has been registered with the sled-agent, so if the `instance_watcher` task activates while an `instance-start` or `instance-migrate` saga is in flight, the `instance-watcher` task may incorrectly move its VMM to `Failed` if the VMM record has been created but not yet registered. Now, after adding the `Creating` variant, we can avoid performing health checks for those VMMs until the sled-agent actually acknowledges a request to register the VMM. Besides the behavior implemented here, RFD 486 also proposes that the `boot_on_fault` field in the instance record be used by Nexus to determine whether to automatically attempt to restart a `Failed` instance. This is described in issues #6491and #4872. I'm planning to implement this in a separate branch, PR #6503. [RFD 486]: https://rfd.shared.oxide.computer/rfd/0486#_determinations

davepacheco · 2024-09-03T22:16:35Z

@hawkw That all makes sense. So in terms of the main ask in this ticket: when Reconfigurator expunges a sled and wants to clean up all those instances, what do we do? Would it be enough to activate the instance-watcher task? Would that task correctly handle instances whose database state shows a vmm on a sled that is now expunged?

hawkw · 2024-09-03T22:30:19Z

@hawkw That all makes sense. So in terms of the main ask in this ticket: when Reconfigurator expunges a sled and wants to clean up all those instances, what do we do? Would it be enough to activate the instance-watcher task? Would that task correctly handle instances whose database state shows a vmm on a sled that is now expunged?

The instance-watcher task does not presently handle that case. At present, it only marks instances as Failed when the sled-agent for that instance responds with an error indicating that it definitively does not believe it's supposed to have that instance (which is what would happen if the sled rebooted, or if the Propolis went away).

Right now, the query that the instance-watcher task uses to find sleds to health-check selects only InService sleds:

omicron/nexus/db-queries/src/db/datastore/instance.rs

Line 823 in a77c31b

.sled_filter(SledFilter::InService)

However, we could pretty easily remove that, and we could change the task to check the state of the sled before starting a health check, and if the sled has been expunged, just go ahead and mark all its VMMs as Failed instead of actually sending a HTTP request to the sled-agent. I'd be happy to go do that!

davepacheco · 2024-09-24T16:13:21Z

Whether or not we consider this done, I think the pieces we were tracking for add/remove sled are done (#6519, #6503) so I'm removing this from the Reconfigurator board.

hawkw · 2024-09-24T17:20:59Z

I think that this is done, modulo:

we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

But, that seems like a substantial enough chunk of work that it might be better considered its own project (and perhaps deserves an RFD?). As far as explicit expungement, I'm for marking this one as done and moving forward potential automated reboots of disappeared sleds into its own thing.

hawkw · 2024-09-26T20:55:19Z

I'm gonna close this out, I think that that detecting unresponsive sleds seems like it'll be part of the ongoing FMA/health-checking work.

davepacheco added the virtualization Propolis Integration & VM Management label Jan 23, 2024

davepacheco mentioned this issue Jan 26, 2024

Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004

Open

davepacheco mentioned this issue Feb 20, 2024

sled agent should terminate Propolis zones when Propolis indicates a previously-started VM has gone missing #3209

Closed

gjcolombo mentioned this issue Mar 6, 2024

sled-agent: don't block during instance creation request from nexus #4691

Merged

askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024

askfongjojo added this to the 8 milestone Mar 9, 2024

This was referenced Apr 17, 2024

what's involved with removing a sled? #4787

Closed

sled expungement must clean up failed instances #5553

Closed

askfongjojo modified the milestones: 8, 9 Apr 24, 2024

smklein mentioned this issue Apr 24, 2024

[nexus] add RPW for proactively pulling instance state from sled-agents #5611

Merged

morlandi7 modified the milestones: 9, 10 Jul 1, 2024

morlandi7 modified the milestones: 10, 11 Aug 13, 2024

davepacheco assigned gjcolombo Aug 21, 2024

gjcolombo mentioned this issue Aug 24, 2024

sled agent: index running VMMs by VMM ID, not instance ID #6429

Merged

gjcolombo mentioned this issue Aug 30, 2024

Nexus should restart Failed instances when boot_on_fault says to #6491

Closed

hawkw mentioned this issue Aug 30, 2024

[nexus] handle sled-agent errors as described in RFD 486 #6455

Merged

hawkw self-assigned this Sep 2, 2024

hawkw mentioned this issue Sep 2, 2024

[nexus] Turn instance.boot_on_fault into an enum #6499

Merged

hawkw mentioned this issue Sep 4, 2024

[nexus] Mark VMMs on Expunged sleds as Failed #6519

Merged

hawkw mentioned this issue Sep 12, 2024

[nexus] Reincarnate Failed instances #6503

Merged

hawkw closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need a way to trigger cleanup and next steps for vanished instances #4872

need a way to trigger cleanup and next steps for vanished instances #4872

davepacheco commented Jan 23, 2024

gjcolombo commented Jan 23, 2024

hawkw commented Sep 2, 2024

davepacheco commented Sep 3, 2024

hawkw commented Sep 3, 2024

davepacheco commented Sep 24, 2024

hawkw commented Sep 24, 2024

hawkw commented Sep 26, 2024

need a way to trigger cleanup and next steps for vanished instances #4872

need a way to trigger cleanup and next steps for vanished instances #4872

Comments

davepacheco commented Jan 23, 2024

gjcolombo commented Jan 23, 2024

hawkw commented Sep 2, 2024

davepacheco commented Sep 3, 2024

hawkw commented Sep 3, 2024

davepacheco commented Sep 24, 2024

hawkw commented Sep 24, 2024

hawkw commented Sep 26, 2024