Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit moving instances to Failed in handle_instance_put_result #4226

Closed
gjcolombo opened this issue Oct 7, 2023 · 2 comments
Closed

Revisit moving instances to Failed in handle_instance_put_result #4226

gjcolombo opened this issue Oct 7, 2023 · 2 comments
Labels
nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management

Comments

@gjcolombo
Copy link
Contributor

Follow-up from the refactoring in #4194.

If Nexus calls sled agent to do something that registers an instance or changes its state, sled agent will try to return a Result<Option<SledInstanceState>, Error>. Nexus will then get back one of the following values:

  • Ok(None) if the call succeeded and this didn't change what sled agent thinks the instance/VMM state is
  • Ok(Some(new_state)) if the call succeeded and sled agent wants Nexus to try to write some new state back to CRDB
  • Err(e) if the call failed, where e: progenitor_client::Error<sled_agent_client::Error>

The first two cases are handled by simply trying to write a state update to CRDB and aren't important here. The error case, however, walks a tightrope:

Err(e) => {
// The sled-agent has told us that it can't do what we
// requested, but does that mean a failure? One example would be
// if we try to "reboot" a stopped instance. That shouldn't
// transition the instance to failed. But if the sled-agent
// *can't* boot a stopped instance, that should transition
// to failed.
//
// Without a richer error type, let the sled-agent tell Nexus
// what to do with status codes.
error!(self.log, "saw {} from instance_put!", e);
// Convert to the Omicron API error type.
//
// N.B. The match below assumes that this conversion will turn
// any 400-level error status from sled agent into an
// `Error::InvalidRequest`.
let e = e.into();
match &e {
// Bad request shouldn't change the instance state.
Error::InvalidRequest { .. } => Err(e),
// Internal server error (or anything else) should change
// the instance state to failed, we don't know what state
// the instance is in.
_ => {
let new_runtime = db::model::InstanceRuntimeState {
state: db::model::InstanceState::new(
InstanceState::Failed,
),
gen: db_instance.runtime_state.gen.next().into(),
..db_instance.runtime_state.clone()
};
// XXX what if this fails?
let result = self
.db_datastore
.instance_update_runtime(
&db_instance.id(),
&new_runtime,
)
.await;
error!(
self.log,
"saw {:?} from setting InstanceState::Failed after bad instance_put",
result,
);
Err(e)
}
}

Nexus will take the Progenitor error it got back, feed it through a From impl to convert it to an Omicron external API error, and then see if that error is a Bad Request error or something else. In the former case, the update is dropped; in the latter case, Nexus assumes some catastrophe has happened and that the instance has to be marked as Failed.

The relevant From impl is here:

// TODO-security this `From` may give us a shortcut in situtations where we
// want more robust consideration of errors. For example, while some errors
// from other services may directly result in errors that percolate to the
// external client, others may require, for example, retries with an alternate
// service instance or additional interpretation to sanitize the output error.
// This should be removed to avoid leaking data.
impl<T: ClientError> From<progenitor::progenitor_client::Error<T>> for Error {
fn from(e: progenitor::progenitor_client::Error<T>) -> Self {
match e {
// This error indicates that the inputs were not valid for this API
// call. It's reflective of either a client-side programming error.
progenitor::progenitor_client::Error::InvalidRequest(msg) => {
Error::internal_error(&format!("InvalidRequest: {}", msg))
}
// This error indicates a problem with the request to the remote
// service that did not result in an HTTP response code, but rather
// pertained to local (i.e. client-side) encoding or network
// communication.
progenitor::progenitor_client::Error::CommunicationError(ee) => {
Error::internal_error(&format!("CommunicationError: {}", ee))
}
// This error represents an expected error from the remote service.
progenitor::progenitor_client::Error::ErrorResponse(rv) => {
let message = rv.message();
match rv.status() {
http::StatusCode::SERVICE_UNAVAILABLE => {
Error::unavail(&message)
}
status if status.is_client_error() => {
Error::invalid_request(&message)
}
_ => Error::internal_error(&message),
}
}
// This error indicates that the body returned by the client didn't
// match what was documented in the OpenAPI description for the
// service. This could only happen for us in the case of a severe
// logic/encoding bug in the remote service or due to a failure of
// our version constraints (i.e. that the call was to a newer
// service with an incompatible response).
progenitor::progenitor_client::Error::InvalidResponsePayload(
ee,
) => Error::internal_error(&format!(
"InvalidResponsePayload: {}",
ee,
)),
// This error indicates that the client generated a response code
// that was not described in the OpenAPI description for the
// service; this could be a success or failure response, but either
// way it indicates a logic or version error as above.
progenitor::progenitor_client::Error::UnexpectedResponse(r) => {
Error::internal_error(&format!(
"UnexpectedResponse: status code {}",
r.status(),
))
}
}
}
}

Notice the following:

  • Progenitor errors that didn't produce a response from sled agent--e.g. disconnections due to timeouts or network weather--all get turned into internal errors. These will cause an instance to move to Failed even if the problem was completely transient.
  • Sled agent responses get flattened into either 400 Bad Request (for all 4XX errors) or 500 Internal Server Error (everything else) unless they happen to be 503 Service Unavailable, which is preserved. (See sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238.)

This creates a great many ways for an instance to be marked as Failed, even if the VM is actually perfectly healthy. The Failed state is also a pain to deal with because a Failed instance can only be deleted and not recovered in any other way (see #2825 and especially #2825 (comment)).

Apart from all of this, an instance can be marked as Failed because of a Propolis failure: for example, if Propolis hits an error while setting up a VM component, it will publish the Failed state and that will bubble back up to Nexus via sled agent.

We should consider a subtler approach to all this. One possible starting place is to distinguish Propolis failures (which really do mean "this VM is dead") from errors accessing the instance from the control plane. We could then, e.g., indicate the latter faults without necessarily changing the instance state, allow further attempts to try to change the instance's state from its last-known state, and/or try to recover connectivity to the instance from a background task.

@gjcolombo gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus virtualization Propolis Integration & VM Management labels Oct 7, 2023
gjcolombo added a commit that referenced this issue Dec 18, 2023
…ype from calls to stop/reboot (#4711)

Restore `instance_reboot` and `instance_stop` to their prior behavior:
if these routines try to contact sled agent and get back a server error,
mark the instance as unhealthy and move it to the Failed state.

Also use `#[source]` instead of message interpolation in
`InstanceStateChangeError::SledAgent`.

This restores the status quo ante from #4682 in anticipation of reaching
a better overall mechanism for dealing with failures to communicate
about instances with sled agents. See #3206, #3238, and #4226 for more
discussion.

Tests: new integration test; stood up a dev cluster, started an
instance, killed the zone with `zoneadm halt`, and verified that calls
to reboot/stop the instance eventually marked it as Failed (due to a
timeout attempting to contact the Propolis zone).

Fixes #4709.
hawkw added a commit that referenced this issue Aug 9, 2024
A number of bugs relating to guest instance lifecycle management have
been observed. These include:

- Instances getting "stuck" in a transient state, such as `Starting` or
`Stopping`, with no way to forcibly terminate them (#4004)
- Race conditions between instances starting and receiving state
updates, which cause provisioning counters to underflow (#5042)
- Instances entering and exiting the `Failed` state when nothing is
actually wrong with them, potentially leaking virtual resources (#4226)

These typically require support intervention to resolve.

Broadly , these issues exist because the control plane's current
mechanisms for understanding and managing an instance's lifecycle state
machine are "kind of a mess". In particular:

- **(Conceptual) ownership of the CRDB `instance` record is currently
split between Nexus and sled-agent(s).** Although Nexus is the only
entity that actually reads or writes to the database, the instance's
runtime state is also modified by the sled-agents that manage its active
Propolis (and, if it's migrating, it's target Propolis), and written to
CRDB on their behalf by Nexus. This means that there are multiple copies
of the instance's state in different places at the same time, which can
potentially get out of sync. When an instance is migrating, its state is
updated by two different sled-agents, and they may potentially generate
state updates that conflict with each other. And, splitting the
responsibility between Nexus and sled-agent makes the code more complex
and harder to understand: there is no one place where all instance state
machine transitions are performed.
- **Nexus doesn't ensure that instance state updates are processed
reliably.** Instance state transitions triggered by user actions, such
as `instance-start` and `instance-delete`, are performed by distributed
sagas, ensuring that they run to completion even if the Nexus instance
executing them comes to an untimely end. This is *not* the case for
operations that result from instance state transitions reported by
sled-agents, which just happen in the HTTP APIs for reporting instance
states. If the Nexus processing such a transition crashes, gets network
partition'd, or encountering a transient error, the instance is left in
an incomplete state and the remainder of the operation will not be
performed.

This branch rewrites much of the control plane's instance state
management subsystem to resolve these issues. At a high level, it makes
the following high-level changes:

- **Nexus is now the sole owner of the `instance` record.** Sled-agents
no longer have their own copies of an instance's `InstanceRuntimeState`,
and do not generate changes to that state when reporting instance
observations to Nexus. Instead, the sled-agent only publishes updates to
the `vmm` and `migration` records (which are never modified by Nexus
directly) and Nexus is the only entity responsible for determining how
an instance's state should change in response to a VMM or migration
state update.
- **When an instance has an active VMM, its effective external state is
determined primarily by the active `vmm` record**, so that fewer state
transitions *require* changes to the `instance` record. PR #5854 laid
the ground work for this change, but it's relevant here as well.
- **All updates to an `instance` record (and resources conceptually
owned by that instance) are performed by a distributed saga.** I've
introduced a new `instance-update` saga, which is responsible for
performing all changes to the `instance` record, virtual provisioning
resources, and instance network config that are performed as part of a
state transition. Moving this to a saga helps us to ensure that these
operations are always run to completion, even in the event of a sudden
Nexus death.
- **Consistency of instance state changes is ensured by distributed
locking.** State changes may be published by multiple sled-agents to
different Nexus replicas. If one Nexus replica is processing a state
change received from a sled-agent, and then the instance's state changes
again, and the sled-agent publishes that state change to a *different*
Nexus...lots of bad things can happen, since the second state change may
be performed from the previous initial state, when it *should* have a
"happens-after" relationship with the other state transition. And, some
operations may contradict each other when performed concurrently.

To prevent these race conditions, this PR has the dubious honor of using
the first _distributed lock_ in the Oxide control plane, the "instance
updater lock". I introduced the locking primitives in PR #5831 --- see
that branch for more discussion of locking.
- **Background tasks are added to prevent missed updates**. To ensure we
cannot accidentally miss an instance update even if a Nexus dies, hits a
network partition, or just chooses to eat the state update accidentally,
we add a new `instance-updater` background task, which queries the
database for instances that are in states that require an update saga
without such a saga running, and starts the requisite sagas.

Currently, the instance update saga runs in the following cases:

- An instance's active VMM transitions to `Destroyed`, in which case the
instance's virtual resources are cleaned up and the active VMM is
unlinked.
- Either side of an instance's live migration reports that the migration
has completed successfully.
- Either side of an instance's live migration reports that the migration
has failed.

The inner workings of the instance-update saga itself is fairly complex,
and has some kind of interesting idiosyncrasies relative to the existing
sagas. I've written up a [lengthy comment] that provides an overview of
the theory behind the design of the saga and its principles of
operation, so I won't reproduce that in this commit message.

[lengthy comment]:
https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
gjcolombo added a commit that referenced this issue Aug 27, 2024
Change sled agent's instance lookup tables so that Propolis jobs are
indexed by Propolis/VMM IDs instead of instance IDs.
This is a prerequisite to revisiting how the Failed instance state
works. See RFD 486 section 6.1 for all the details of why this is
needed, but very broadly: when an instance's VMM is Destroyed, we'd like
sled agent to tell Nexus that *before* the agent deregisters the
instance from the sled, for reasons described in the RFD; but if we do
that with no other changes, there's a race where Nexus may try to
restart the instance on the same sled before sled agent can update its
instance table, causing instance start to fail.

To achieve this:

- In sled agent, change the `InstanceManagerRunner`'s instance map to a
`BTreeMap<PropolisUuid, Instance>`, then clean up all the compilation
errors.
- In Nexus:
- Make callers of instance APIs furnish a Propolis ID instead of an
instance ID. This is generally very straightforward because they already
had to get a VMM record to figure out what sled to talk to.
- Change `cpapi_instances_put` to take a Propolis ID instead of an
instance ID. Regular sled agent still has both IDs, but with these
changes, simulated sled agents only have a Propolis ID to work with, and
plumbing an instance ID down to them requires significant code changes.
- Update test code:
- Unify the Nexus helper routines that let integration tests get sled
agent clients or sled IDs; now they get a single struct containing both
of those and the instance's Propolis IDs.
- Update users of the simulated agent's `poke` endpoints to use Propolis
IDs.
- Delete the "detach disks on instance stop" bits of simulated sled
agent. These don't appear to be load-bearing, they don't correspond to
any behavior in the actual sled agent (which doesn't manage disk
attachment or detachment), and it was a pain to rework them to work with
Propolis IDs.

Tests: cargo nextest.

Related: #4226 and #4872, among others.
@hawkw
Copy link
Member

hawkw commented Sep 24, 2024

I think #6455, which implemented RFD 486's more principled rules for what sled-agent errors should move an instance to Failed, has closed this. @gjcolombo do you agree?

@gjcolombo
Copy link
Contributor Author

Yeah, I think we've taken care of this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Projects
None yet
Development

No branches or pull requests

2 participants