-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need a way to trigger cleanup and next steps for vanished instances #4872
Comments
#4226 is also tangentially related. Today Nexus uses the "Failed" instance state to mean, more or less, "something went wrong during some operation on this instance," and the only recovery action we support is deleting the instance (see #2825). #4226 proposes distinguishing between different values of "something" so that we can recover appropriately: transient error conditions might resolve on retry, but if a sled reports that a VMM is definitely gone then the error is permanent. 4226 could possibly be resolved without handling cases like "the operator says this sled is gone, so all the living VMMs there should immediately be marked as gone," so these issues aren't duplicates. But I think they're all dancing around a single concept--Nexus should have an idea of a VMM's "health," which I think could start out along these lines:
Then there should be some code path that periodically deals with "definitely-dead" VMMs that are still pointed to by their Instances, as discussed above, and the "rudely remove a sled" API would just need to mark the target VMMs as "definitely stopped, did not gracefully shut down." |
Change sled agent's instance lookup tables so that Propolis jobs are indexed by Propolis/VMM IDs instead of instance IDs. This is a prerequisite to revisiting how the Failed instance state works. See RFD 486 section 6.1 for all the details of why this is needed, but very broadly: when an instance's VMM is Destroyed, we'd like sled agent to tell Nexus that *before* the agent deregisters the instance from the sled, for reasons described in the RFD; but if we do that with no other changes, there's a race where Nexus may try to restart the instance on the same sled before sled agent can update its instance table, causing instance start to fail. To achieve this: - In sled agent, change the `InstanceManagerRunner`'s instance map to a `BTreeMap<PropolisUuid, Instance>`, then clean up all the compilation errors. - In Nexus: - Make callers of instance APIs furnish a Propolis ID instead of an instance ID. This is generally very straightforward because they already had to get a VMM record to figure out what sled to talk to. - Change `cpapi_instances_put` to take a Propolis ID instead of an instance ID. Regular sled agent still has both IDs, but with these changes, simulated sled agents only have a Propolis ID to work with, and plumbing an instance ID down to them requires significant code changes. - Update test code: - Unify the Nexus helper routines that let integration tests get sled agent clients or sled IDs; now they get a single struct containing both of those and the instance's Propolis IDs. - Update users of the simulated agent's `poke` endpoints to use Propolis IDs. - Delete the "detach disks on instance stop" bits of simulated sled agent. These don't appear to be load-bearing, they don't correspond to any behavior in the actual sled agent (which doesn't manage disk attachment or detachment), and it was a pain to rework them to work with Propolis IDs. Tests: cargo nextest. Related: #4226 and #4872, among others.
PR #6455 will move instances to Transitions to
We don't currently transition instances to
as that kind of health checking is a bit more complex, but we could certainly start transitioning instances to Once #6455 is merged, PR #6499 refactors |
[RFD 486] describes a principled design for how to use the Failed state for instances and VMMs. In particular, it proposes that: - Instances and VMMs should move to the `Failed` state independently. - A VMM goes to `Failed` if either: + Nexus observes that a VMM it thought was resident on a sled is no longer resident there, or, + it observes through some other means (sled expungement, sled reboot) that all VMMs that were resident on that sled must now be gone. - Instances go to `Failed` when an `instance-update` saga observes that an instance's active VMM has been marked `Failed`. An update saga transitioning an instance to failed cleans up the instance's resource allocations, similarly to how update sagas handle transitioning an instance to `Destroyed` when its active VMM has been destroyed. - For purposes of the external API: + An instance that is itself in the `Failed` state appears externally to be `Failed`. + If an instance in the "has-VMM" internal state and points to a `Failed` VMM, the instance is `Stopping` and then transitions to `Failed`. + Once the actual instance (not just its active VMM) is `Failed`, it can be started or destroyed. This branch implements the behavior described above. In particular, I've added code to the `instance-update` saga to handle instances whose active or migration target VMMs have transitioned to `Failed`, similarly to how it handles the VMMs transitioning to `Destroyed`. I've changed the code that detects whether a sled-agent error indicates that an instance has failed to only do so when the error is a 404 with the specific error code that indicates that the sled-agent has forgotten about a VMM that it previously knew about, and changed the Nexus functions that ask sled-agents to update an instance's state, and the `instance_watcher` background task, to mark instances as `Failed` when they encounter only the appropriate sled-agent error. And, I've changed the `mark_instance_failed` function in Nexus to instead mark the VMM record as failed and trigger an instance-update saga (and it's now named `mark_vmm_failed`), and made similar changes to the `instance_watcher` background task. Finally, I've also made sure that `Failed` instances can be destroyed and restarted, and that instances with `Failed` active VMMs appear to be `Stopping` until the VMM is cleaned up by an update saga. In addition to this, it was necessary to (re)introduce a `VmmState::Creating` variant. The purpose of this variant is to differentiate between VMMs which have been created in the database but do not yet exist on a sled, from those which are known to sled-agent and are actually starting/migrating. This is necessary because presently, the `instance_watcher` background task will attempt to perform health checks for `Starting` and `Migrating` VMMs, and --- if the sled-agent does not actually know about the VMM yet --- will move it to `Failed`. But, when a VMM is newly created, the sled-agent won't know about it yet until it has been registered with the sled-agent, so if the `instance_watcher` task activates while an `instance-start` or `instance-migrate` saga is in flight, the `instance-watcher` task may incorrectly move its VMM to `Failed` if the VMM record has been created but not yet registered. Now, after adding the `Creating` variant, we can avoid performing health checks for those VMMs until the sled-agent actually acknowledges a request to register the VMM. Besides the behavior implemented here, RFD 486 also proposes that the `boot_on_fault` field in the instance record be used by Nexus to determine whether to automatically attempt to restart a `Failed` instance. This is described in issues #6491and #4872. I'm planning to implement this in a separate branch, PR #6503. [RFD 486]: https://rfd.shared.oxide.computer/rfd/0486#_determinations
@hawkw That all makes sense. So in terms of the main ask in this ticket: when Reconfigurator expunges a sled and wants to clean up all those instances, what do we do? Would it be enough to activate the instance-watcher task? Would that task correctly handle instances whose database state shows a vmm on a sled that is now expunged? |
The instance-watcher task does not presently handle that case. At present, it only marks instances as Right now, the query that the
However, we could pretty easily remove that, and we could change the task to check the state of the sled before starting a health check, and if the sled has been expunged, just go ahead and mark all its VMMs as |
I think that this is done, modulo:
But, that seems like a substantial enough chunk of work that it might be better considered its own project (and perhaps deserves an RFD?). As far as explicit expungement, I'm for marking this one as done and moving forward potential automated reboots of disappeared sleds into its own thing. |
I'm gonna close this out, I think that that detecting unresponsive sleds seems like it'll be part of the ongoing FMA/health-checking work. |
There are various cases where the control plane might discover that Instances that we thought were running are, in fact, not. Examples (from concrete to speculative):
Relaying some notes from @gjcolombo (my apologies where I've got these details wrong):
We may view this issue as a dup of #3742 but I wanted to file this separate ticket to reflect more precisely what's needed for sled removal. The implementation may well just be #3742.
The text was updated successfully, but these errors were encountered: