-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit moving instances to Failed in handle_instance_put_result
#4226
Labels
nexus
Related to nexus
Sled Agent
Related to the Per-Sled Configuration and Management
virtualization
Propolis Integration & VM Management
Comments
gjcolombo
added
Sled Agent
Related to the Per-Sled Configuration and Management
nexus
Related to nexus
virtualization
Propolis Integration & VM Management
labels
Oct 7, 2023
This was referenced Dec 13, 2023
gjcolombo
added a commit
that referenced
this issue
Dec 18, 2023
…ype from calls to stop/reboot (#4711) Restore `instance_reboot` and `instance_stop` to their prior behavior: if these routines try to contact sled agent and get back a server error, mark the instance as unhealthy and move it to the Failed state. Also use `#[source]` instead of message interpolation in `InstanceStateChangeError::SledAgent`. This restores the status quo ante from #4682 in anticipation of reaching a better overall mechanism for dealing with failures to communicate about instances with sled agents. See #3206, #3238, and #4226 for more discussion. Tests: new integration test; stood up a dev cluster, started an instance, killed the zone with `zoneadm halt`, and verified that calls to reboot/stop the instance eventually marked it as Failed (due to a timeout attempting to contact the Propolis zone). Fixes #4709.
hawkw
added a commit
that referenced
this issue
Aug 9, 2024
A number of bugs relating to guest instance lifecycle management have been observed. These include: - Instances getting "stuck" in a transient state, such as `Starting` or `Stopping`, with no way to forcibly terminate them (#4004) - Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (#5042) - Instances entering and exiting the `Failed` state when nothing is actually wrong with them, potentially leaking virtual resources (#4226) These typically require support intervention to resolve. Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular: - **(Conceptual) ownership of the CRDB `instance` record is currently split between Nexus and sled-agent(s).** Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed. - **Nexus doesn't ensure that instance state updates are processed reliably.** Instance state transitions triggered by user actions, such as `instance-start` and `instance-delete`, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is *not* the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed. This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes: - **Nexus is now the sole owner of the `instance` record.** Sled-agents no longer have their own copies of an instance's `InstanceRuntimeState`, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the `vmm` and `migration` records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update. - **When an instance has an active VMM, its effective external state is determined primarily by the active `vmm` record**, so that fewer state transitions *require* changes to the `instance` record. PR #5854 laid the ground work for this change, but it's relevant here as well. - **All updates to an `instance` record (and resources conceptually owned by that instance) are performed by a distributed saga.** I've introduced a new `instance-update` saga, which is responsible for performing all changes to the `instance` record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death. - **Consistency of instance state changes is ensured by distributed locking.** State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a *different* Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it *should* have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently. To prevent these race conditions, this PR has the dubious honor of using the first _distributed lock_ in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR #5831 --- see that branch for more discussion of locking. - **Background tasks are added to prevent missed updates**. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new `instance-updater` background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas. Currently, the instance update saga runs in the following cases: - An instance's active VMM transitions to `Destroyed`, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked. - Either side of an instance's live migration reports that the migration has completed successfully. - Either side of an instance's live migration reports that the migration has failed. The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a [lengthy comment] that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message. [lengthy comment]: https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
gjcolombo
added a commit
that referenced
this issue
Aug 27, 2024
Change sled agent's instance lookup tables so that Propolis jobs are indexed by Propolis/VMM IDs instead of instance IDs. This is a prerequisite to revisiting how the Failed instance state works. See RFD 486 section 6.1 for all the details of why this is needed, but very broadly: when an instance's VMM is Destroyed, we'd like sled agent to tell Nexus that *before* the agent deregisters the instance from the sled, for reasons described in the RFD; but if we do that with no other changes, there's a race where Nexus may try to restart the instance on the same sled before sled agent can update its instance table, causing instance start to fail. To achieve this: - In sled agent, change the `InstanceManagerRunner`'s instance map to a `BTreeMap<PropolisUuid, Instance>`, then clean up all the compilation errors. - In Nexus: - Make callers of instance APIs furnish a Propolis ID instead of an instance ID. This is generally very straightforward because they already had to get a VMM record to figure out what sled to talk to. - Change `cpapi_instances_put` to take a Propolis ID instead of an instance ID. Regular sled agent still has both IDs, but with these changes, simulated sled agents only have a Propolis ID to work with, and plumbing an instance ID down to them requires significant code changes. - Update test code: - Unify the Nexus helper routines that let integration tests get sled agent clients or sled IDs; now they get a single struct containing both of those and the instance's Propolis IDs. - Update users of the simulated agent's `poke` endpoints to use Propolis IDs. - Delete the "detach disks on instance stop" bits of simulated sled agent. These don't appear to be load-bearing, they don't correspond to any behavior in the actual sled agent (which doesn't manage disk attachment or detachment), and it was a pain to rework them to work with Propolis IDs. Tests: cargo nextest. Related: #4226 and #4872, among others.
I think #6455, which implemented RFD 486's more principled rules for what sled-agent errors should move an instance to |
Yeah, I think we've taken care of this one. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
nexus
Related to nexus
Sled Agent
Related to the Per-Sled Configuration and Management
virtualization
Propolis Integration & VM Management
Follow-up from the refactoring in #4194.
If Nexus calls sled agent to do something that registers an instance or changes its state, sled agent will try to return a
Result<Option<SledInstanceState>, Error>
. Nexus will then get back one of the following values:Ok(None)
if the call succeeded and this didn't change what sled agent thinks the instance/VMM state isOk(Some(new_state))
if the call succeeded and sled agent wants Nexus to try to write some new state back to CRDBErr(e)
if the call failed, wheree: progenitor_client::Error<sled_agent_client::Error>
The first two cases are handled by simply trying to write a state update to CRDB and aren't important here. The error case, however, walks a tightrope:
omicron/nexus/src/app/instance.rs
Lines 891 to 943 in 2bb0fdc
Nexus will take the Progenitor error it got back, feed it through a
From
impl to convert it to an Omicron external API error, and then see if that error is a Bad Request error or something else. In the former case, the update is dropped; in the latter case, Nexus assumes some catastrophe has happened and that the instance has to be marked as Failed.The relevant From impl is here:
omicron/common/src/api/external/error.rs
Lines 355 to 418 in c030293
Notice the following:
This creates a great many ways for an instance to be marked as Failed, even if the VM is actually perfectly healthy. The Failed state is also a pain to deal with because a Failed instance can only be deleted and not recovered in any other way (see #2825 and especially #2825 (comment)).
Apart from all of this, an instance can be marked as Failed because of a Propolis failure: for example, if Propolis hits an error while setting up a VM component, it will publish the Failed state and that will bubble back up to Nexus via sled agent.
We should consider a subtler approach to all this. One possible starting place is to distinguish Propolis failures (which really do mean "this VM is dead") from errors accessing the instance from the control plane. We could then, e.g., indicate the latter faults without necessarily changing the instance state, allow further attempts to try to change the instance's state from its last-known state, and/or try to recover connectivity to the instance from a background task.
The text was updated successfully, but these errors were encountered: