Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004

Open
gjcolombo opened this issue Aug 31, 2023 · 7 comments · May be fixed by #6795
Open

Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004

gjcolombo opened this issue Aug 31, 2023 · 7 comments · May be fixed by #6795
Assignees
Labels
known issue To include in customer documentation and training nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management
Milestone

Comments

@gjcolombo
Copy link
Contributor

Today, requests to stop a running instance must by necessity involve the instance's active Propolis (sled agent sends the stop request to Propolis; the instance is stopped when Propolis says so, at which point sled agent cleans up the Propolis zone and all related objects). If an instance's Propolis is not responding, or there is no active Propolis, there is no obvious way to clean up the instance and recover.

A short-term workaround is to grant some form of API access to sled agent's "unregister instance" API, which forcibly executes the termination path (tearing down the Propolis zone and removing the instance from the sled's instance table) and can get force an instance into a stopped state.

In the long run instance lifecycle management needs to be made more robust to Propolis failure and/or non-responsiveness.

@gjcolombo gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus labels Aug 31, 2023
@jordanhendricks
Copy link
Contributor

Related: #3209

@gjcolombo gjcolombo changed the title Instance stop doesn't work if Propolis panics or is unreachable Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state Oct 6, 2023
@gjcolombo
Copy link
Contributor Author

In #4194, sled agent's "instance unregister" API assumes that it can produce the correct posterior VMM and instance states by emulating a "Propolis destroyed" state transition. (That is, sled agent's "rudely terminate this instance" function computes the next state by pretending that it immediately got a message from the current VMM that says "I am destroyed and my current migration has failed.")

This is fine today because the unregister API is only used when unwinding start and migrate sagas, where the VMMs that are subject to unregistration have by definition not gotten to do anything interesting yet. It's less fine if Propolis has already begun to run, and especially not fine if we're force-quitting an instance that's a migration target:

  1. Source and target VMMs agree that migration is done
  2. Just before they tell their respective sled agents, Nexus shows up and rudely terminates the source
  3. The emulated "VMM destroyed, migration failed" transition retires the source and the migration, which makes it possible to start the instance again...
  4. ...but the target might actually begin to run while this is happening!

This seems dangerous. We probably want to adjust the synchronization here to be something more like the following:

  1. Nexus "disowns" a particular VMM, e.g. by replacing its ID in an instance record with a zeroed ID (or some other set of sentinel values that prevents the instance from restarting)
  2. Nexus terminates all of the disowned VMMs
  3. Nexus clears out the sentinels to allow the instance to start again

This would need to be done in a saga to ensure the whole process runs to completion. More design work's needed here. We just need to do that work before we hook up any external APIs to the existing instance_ensure_unregistered sled agent interface.

@davepacheco
Copy link
Collaborator

Also related: #4872

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024
@askfongjojo askfongjojo added this to the 8 milestone Mar 9, 2024
@askfongjojo askfongjojo modified the milestones: 8, 9 Apr 24, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 1, 2024
hawkw added a commit that referenced this issue Aug 9, 2024
A number of bugs relating to guest instance lifecycle management have
been observed. These include:

- Instances getting "stuck" in a transient state, such as `Starting` or
`Stopping`, with no way to forcibly terminate them (#4004)
- Race conditions between instances starting and receiving state
updates, which cause provisioning counters to underflow (#5042)
- Instances entering and exiting the `Failed` state when nothing is
actually wrong with them, potentially leaking virtual resources (#4226)

These typically require support intervention to resolve.

Broadly , these issues exist because the control plane's current
mechanisms for understanding and managing an instance's lifecycle state
machine are "kind of a mess". In particular:

- **(Conceptual) ownership of the CRDB `instance` record is currently
split between Nexus and sled-agent(s).** Although Nexus is the only
entity that actually reads or writes to the database, the instance's
runtime state is also modified by the sled-agents that manage its active
Propolis (and, if it's migrating, it's target Propolis), and written to
CRDB on their behalf by Nexus. This means that there are multiple copies
of the instance's state in different places at the same time, which can
potentially get out of sync. When an instance is migrating, its state is
updated by two different sled-agents, and they may potentially generate
state updates that conflict with each other. And, splitting the
responsibility between Nexus and sled-agent makes the code more complex
and harder to understand: there is no one place where all instance state
machine transitions are performed.
- **Nexus doesn't ensure that instance state updates are processed
reliably.** Instance state transitions triggered by user actions, such
as `instance-start` and `instance-delete`, are performed by distributed
sagas, ensuring that they run to completion even if the Nexus instance
executing them comes to an untimely end. This is *not* the case for
operations that result from instance state transitions reported by
sled-agents, which just happen in the HTTP APIs for reporting instance
states. If the Nexus processing such a transition crashes, gets network
partition'd, or encountering a transient error, the instance is left in
an incomplete state and the remainder of the operation will not be
performed.

This branch rewrites much of the control plane's instance state
management subsystem to resolve these issues. At a high level, it makes
the following high-level changes:

- **Nexus is now the sole owner of the `instance` record.** Sled-agents
no longer have their own copies of an instance's `InstanceRuntimeState`,
and do not generate changes to that state when reporting instance
observations to Nexus. Instead, the sled-agent only publishes updates to
the `vmm` and `migration` records (which are never modified by Nexus
directly) and Nexus is the only entity responsible for determining how
an instance's state should change in response to a VMM or migration
state update.
- **When an instance has an active VMM, its effective external state is
determined primarily by the active `vmm` record**, so that fewer state
transitions *require* changes to the `instance` record. PR #5854 laid
the ground work for this change, but it's relevant here as well.
- **All updates to an `instance` record (and resources conceptually
owned by that instance) are performed by a distributed saga.** I've
introduced a new `instance-update` saga, which is responsible for
performing all changes to the `instance` record, virtual provisioning
resources, and instance network config that are performed as part of a
state transition. Moving this to a saga helps us to ensure that these
operations are always run to completion, even in the event of a sudden
Nexus death.
- **Consistency of instance state changes is ensured by distributed
locking.** State changes may be published by multiple sled-agents to
different Nexus replicas. If one Nexus replica is processing a state
change received from a sled-agent, and then the instance's state changes
again, and the sled-agent publishes that state change to a *different*
Nexus...lots of bad things can happen, since the second state change may
be performed from the previous initial state, when it *should* have a
"happens-after" relationship with the other state transition. And, some
operations may contradict each other when performed concurrently.

To prevent these race conditions, this PR has the dubious honor of using
the first _distributed lock_ in the Oxide control plane, the "instance
updater lock". I introduced the locking primitives in PR #5831 --- see
that branch for more discussion of locking.
- **Background tasks are added to prevent missed updates**. To ensure we
cannot accidentally miss an instance update even if a Nexus dies, hits a
network partition, or just chooses to eat the state update accidentally,
we add a new `instance-updater` background task, which queries the
database for instances that are in states that require an update saga
without such a saga running, and starts the requisite sagas.

Currently, the instance update saga runs in the following cases:

- An instance's active VMM transitions to `Destroyed`, in which case the
instance's virtual resources are cleaned up and the active VMM is
unlinked.
- Either side of an instance's live migration reports that the migration
has completed successfully.
- Either side of an instance's live migration reports that the migration
has failed.

The inner workings of the instance-update saga itself is fairly complex,
and has some kind of interesting idiosyncrasies relative to the existing
sagas. I've written up a [lengthy comment] that provides an overview of
the theory behind the design of the saga and its principles of
operation, so I won't reproduce that in this commit message.

[lengthy comment]:
https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 14, 2024
@morlandi7 morlandi7 modified the milestones: 11, 12 Sep 26, 2024
@hawkw
Copy link
Member

hawkw commented Oct 3, 2024

Do we expect the interface into the forcibly-terminate operation to be exposed in the external API, and if so, would we want to make it a more privileged operation than the normal instance-stop and instance-delete APIs?

@gjcolombo
Copy link
Contributor Author

Do we expect the interface into the forcibly-terminate operation to be exposed in the external API, and if so, would we want to make it a more privileged operation than the normal instance-stop and instance-delete APIs?

I think the answers are "yes" and "no." The idea of the API is to give users a crowbar that they can use to unstick an instance in the unlikely event that it gets stuck in a transitional Propolis state like Starting or Stopping. I would give the API an appropriately forceful name ("force-quit"? "force-terminate"?) to try to emphasize that this just blows away the entire VM process and doesn't give anything in it the chance to run any cleanup logic, and I'm not sure I'd add a console option for it (right away, anyway), but I do think it should be available to regular users.

@hawkw hawkw self-assigned this Oct 4, 2024
@hawkw
Copy link
Member

hawkw commented Oct 4, 2024

Yup, that makes sense.

@hawkw
Copy link
Member

hawkw commented Oct 9, 2024

We recently discussed this, and came to the conclusion that it seems unfortunate to present the user with two different ways to stop an instance, one of which has a big warning label on it that says "only do this in case of emergencies" but have no difference in observable effects from the guest's perspective.1 This forces the user to decide which way of essentially just pulling the virtual power cord out of their VM to use.

Instead, we might consider just making the system resilient to Propolis getting stuck or misbehaving whilst shutting down --- a normal instance-stop request could cause sled-agent to set a timeout, after which the Propolis zone is forcefully deleted if Propolis doesn't report in to say it's exited normally.

Furthermore, we would like to eventually make Propolis attempt to shut guests down more gracefully, but that's out of scope for this issue. See oxidecomputer/propolis#784

Footnotes

  1. Given that Propolis does not currently attempt to gracefully shut down the guest, and the difference between "stop" and "force-terminate" is just whether Propolis itself is given the opportunity to put its affairs in order before the zone is torn down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants