Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] add POST /v1/instances/{instance}/force-terminate #6795

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Oct 7, 2024

Fixes #4004

Today, requests to stop a running instance must by necessity involve the
instance's active Propolis (sled agent sends the stop request to
Propolis; the instance is stopped when Propolis says so, at which point
sled agent cleans up the Propolis zone and all related objects). If an
instance's Propolis is not responding, or there is no active Propolis,
there is no obvious way to clean up the instance and recover.

In order to allow resolving otherwise stuck instances, this commit
introduces a new API endpoint, POST /v1/instances/{instance}/force-terminate, which calls directly into the
sled-agent's instance-ensure-unregistered API, telling it to rudely
terminate the VMM and destroy the Propolis zone, without waiting for it
to shut down politely.

The one complex-ish bit is the issue I fixed in commit 34c3058 around
what this API does in the case where the sled-agent has forgotten an
instance due to an unexpected restart. The sled-agent
instance-ensure-unregistered API is idempotent, which is right and
proper...but it introduces an issue in the case of a forgotten VMM after
a sled-agent restart. The instance-ensure-unregistered call will
return None because sled-agent doesn't know about that instance, but
if this is the first time we have discovered that sled-agent doesn't
know about the instance, we will need to move it to Failed. This is
okay to do, because the VMM generation number guards against the case
where we have raced with another instance-force-terminate call. We
will only move the VMM to Failed in that case if no one else has moved
it to Destroyed as the result of a successful
instance-ensure-unregistered in the interim.

The sled-agent instance-ensure-unregistered API is idempotent, which is
right and proper...but it introduces an issue in the case of a
forgotten VMM after a sled-agent restart. The
instance-ensure-unregistered call will return `None` because sled-agent
doesn't know about that instance, but if this is the first time we have
discovered that sled-agent doesn't know about the instance, we will need
to move it to `Failed`. This commit fixes that.
nexus/external-api/src/lib.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
Comment on lines +814 to +815
let unregister_result =
self.instance_ensure_unregistered(&propolis_id, &sled_id).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the start saga clean up properly if this happens between its ensure_registered and ensure_running steps? I think it works out: sis_ensure_running will fail; sis_ensure_registered_undo will also fail and try to move the VMM to Failed, but this doesn't block the saga from continuing to unwind; then I think we'll end up in SagaUnwound and end up with a VM that can be started again. Does that sound about right? If so, it might be worthwhile to add a comment mentioning this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but I would like to figure out whether it's possible to test this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, upon further inspection, I believe your assessment is correct and this is fine.

I think that sis_ensure_registered_undo failing to move the VMM to Failed will get the saga stuck:

match e {
InstanceStateChangeError::SledAgent(inner) if inner.vmm_gone() => {
error!(osagactx.log(),
"start saga: failing instance after unregister failure";
"instance_id" => %instance_id,
"start_reason" => ?params.reason,
"error" => ?inner);
if let Err(set_failed_error) = osagactx
.nexus()
.mark_vmm_failed(&opctx, authz_instance, &db_vmm, &inner)
.await
{
error!(osagactx.log(),
"start saga: failed to mark instance as failed";
"instance_id" => %instance_id,
"start_reason" => ?params.reason,
"error" => ?set_failed_error);
Err(set_failed_error.into())

However, mark_vmm_failed won't actually fail here if the VMM's state generation is stale, it will just scream about it:

// XXX: It's not clear what to do with this error; should it be
// bubbled back up to the caller?
Err(e) => error!(self.log,
"failed to write Failed instance state to DB";
"instance_id" => %instance_id,
"vmm_id" => %vmm_id,
"error" => ?e),

So, I think we're in the clear here. But, I feel a bit uncomfortable about this, because it seems like a change to the mark_vmm_failed error path to actually return an error could introduce a bug here, and I'm not immediately sure if there's an obvious way to write a regression test that would fail on such a change. Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something with error types instead? That is: have mark_vmm_failed return its own error enum with "DB query failed" and "update too old" variants, and have sis_ensure_registered_undo match on specific error codes from this callee, such that a new failure mode will break the match. WDYT? It'd be nice to have an integration test, too, but I'm similarly having trouble figuring out how to inject a failure into this specific call, since we don't have a CRDB test double that I know of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, actually, the write to CRDB in mark_vmm_failed is done by a call to vmm_update_runtime, which returns Ok(false) if the VMM exists but wasn't updated (e.g. if the generation has advanced). So, we don't even hit the error path (which gets ignored anyway) in mark_vmm_failed in that case...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which, now that I look at it, indicates a tiny possible inefficiency in mark_vmm_failed --- currently, we try to run an update saga even if the VMM did not move to failed (because it was already failed/destroyed), which means we will probably start spurious update sagas there. I'll clean that up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm probably fine with having mark_vmm_failed stick with the convention of returning Ok(false) if the generation has changed, instead of an error variant, since it seems like we generally follow that convention for similar code. On one hand, I do have a sort of ideological preference for an Ok(_) to always mean "yes, we actually wrote the desired update to CRDB", but on the other hand, a lot of saga undo actions probably just ? these functions, and would probably prefer to get an Ok in any case that doesn't mean they should unwind. I dunno.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark_vmm_failed actually has the type signature of async fn(...) -> Result<(), Error> but there are actually no conditions in which it will ever currently return an error. That's cool.

I might just change the type signature to not return Result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further perusal of sis_ensure_registered_undo, I noticed that it was actually handling sled-agent unregister errors in a pretty outdated way: it was treating vmm_gone errors as a probable failure, and just ignoring any other errors communicating with the sled-agent, as described in this comment:

// If the failure came from talking to sled agent, and the error code
// indicates the instance or sled might be unhealthy, manual
// intervention is likely to be needed, so try to mark the instance as
// Failed and then bail on unwinding.
//
// If sled agent is in good shape but just doesn't know about the
// instance, this saga still owns the instance's state, so allow
// unwinding to continue.
//
// If some other Nexus error occurred, this saga is in bad shape, so
// return an error indicating that intervention is needed without trying
// to modify the instance further.
//
// TODO(#3238): `instance_unhealthy` does not take an especially nuanced
// view of the meanings of the error codes sled agent could return, so
// assuming that an error that isn't `instance_unhealthy` means
// that everything is hunky-dory and it's OK to continue unwinding may
// be a bit of a stretch. See the definition of `instance_unhealthy` for
// more details.

This predates the changes to vmm_gone/instance_unhealthy semantics from RFD 486/#6455, and isn't quite correct. I've changed it to make the "other error from sled agent" be the case that fails the saga, although we should probably retry communication errors eventually.

The changes to mark_vmm_failed and sis_ensure_registered_undo are in a923f2a. Arguably, it's not really in scope for this PR at that point; I'd be happy to cherry-pick it out to a separate branch if you think that's worth doing?

@hawkw
Copy link
Member Author

hawkw commented Oct 8, 2024

@gjcolombo I've changed this so that we now also tear down migration target VMMs (see d6a2bd4).

In the process, I also changed the code for terminating a VMM to also synchronously attempt to run an update saga to completion (e841bd0). This came up because I noticed that, when force-terminating a Migrating instance, the returned instance will still appear to be Migrating even though both VMMs have been destroyed, because it still has a migration ID set until an update saga removes it. We had decided to do that to stop migrating VMMs from briefly appearing to be Stopped when a migration completes and an active VMM is destroyed, but the instance hasn't yet been updated to reflect that it is now running on the target. I think this is still correct behavior, but it seemed kinda sad to "force terminate" an instance and see that the force-terminate succeeded but it's still Migrating, so I made the force-terminate method run the whole saga.

@hawkw hawkw requested a review from gjcolombo October 8, 2024 21:58
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
Comment on lines +814 to +815
let unregister_result =
self.instance_ensure_unregistered(&propolis_id, &sled_id).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something with error types instead? That is: have mark_vmm_failed return its own error enum with "DB query failed" and "update too old" variants, and have sis_ensure_registered_undo match on specific error codes from this callee, such that a new failure mode will break the match. WDYT? It'd be nice to have an integration test, too, but I'm similarly having trouble figuring out how to inject a failure into this specific call, since we don't have a CRDB test double that I know of.

@hawkw hawkw requested a review from gjcolombo October 9, 2024 16:41
Comment on lines +6066 to +6073
// Okay, what if the instance is already gone?
let instance = dbg!(
instance_post(&client, &already_gone_name, InstanceOp::ForceTerminate)
.await
);
// This time, the instance will go to `Failed` rather than `Stopped` since
// sled-agent is no longer aware of it.
assert_eq!(instance.runtime.run_state, InstanceState::Failed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw this initially seemed concerning ("if you force-stop an instance in the right state, now you have a failed instance?"), but i think i've figured out the preconditions by reading more carefully: this is specifically if you force-stop an instance where Nexus believes a sled-agent is responsible for the instance, but that sled-agent doesn't know it exists, and that already is a sign that something has gone wrong. so it's an instance that probably should be failed anyway, we just hadn't figured that out yet?

if it helps on wording, i think "already gone" is what primed me to think this might be applicable in some kind of race around stopping.

that said, i think there's a race where if you force-stop an instance alongside a call to instance_stop you can end up with an instance in Failed where auto-restart would just turn it back on again? specifically if instance_stop has reached out to sled-agent, sled-agent responds that it has stopped the instance, but we haven't actually recorded that state update yet, then force_stop could show up and see Nexus thinks that sled-agent has the VMM while that sled-agent thinks it's gone.

i haven't looked at the instance_stop code much yet so maybe i'm misunderstanding and there's not actually something there..? but if i'm following these bits right, it seems kind of unfortunate that trying too enthusiastically to stop an instance could end up making it auto-restart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple racing stop attempts shouldn't transition an instance to failed, since the query in mark_vmm_failed that transitions the VMM to failed is conditional on the sequence number of the VMM record remaining the same as when the VMM was read. If the VMM was destroyed between when we read the record initially and when we asked the sled-agent to destroy it, the sequence number will have advanced, and we won't move it to failed. It only moves to failed if it was already nonexistent.

I see your point about auto-restarts, though: perhaps we should just always move a force-stopped instance to Stopped when encountering a sled-agent error that would otherwise move it to Failed, since you're trying to stop it anyway, which presumably means you have other plans for the instance? I dunno. @gjcolombo, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we should probably also bail out early if we see the VMM record is already Destroyed, though, rather than asking the sled-agent to destroy it again...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! i see your point about the non-raciness now - if there's still a Propolis ID for an instance, the force-terminate call might cause Nexus to get to instance_ensure_unregistered and in fact ask sled-agent to ensure the Propolis has been terminated, but in that case sled-agent's ensure_unregistered would return VmmUnregisterResponse { updated_runtime: None } rather than an error that the Propolis that should be unregistered is already gone. to the point of sled-agent docs, ensure_unregistered is dempotent.

so, among other things, new Err(InstanceStateChangeError::SledAgent(e)) if e.vmm_gone() arm in instance.rs shouldn't ever be reached, and that case of marking an instance failed doesn't matter so much.

fully convinced it's not racy now, thanks for bearing with me.

Comment on lines +1173 to +1175
/// "stopped" state without transitioning through the "stopping" state.
/// This operation can be used to recover an instance that is not
/// responding to requests to stop issued through the instance stop API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting my mildly-uninformed-user hat on, is there something important i could be missing out on by not transitioning through "stopping"? resources that could be leaked (seems unlikely), or internal instance state that could get weird (seems more likely)? from what's here it doesn't seem unreasonable for a user to always /force-terminate on the assumption that it's more like yanking power, and i dunno how much anyone would be disturbed by that. i recognize this is also kind of the ambiguity @gjcolombo was trying to address, sorry :)

putting my Oxide engineer hat on, it feels like any reason to use /force-terminate is a result of a user trying to unwedge themselves from an Oxide bug. so maybe that's the kind of warding this documentation deserves? though i'm still not sure how load-bearing stopping is..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance will still transition through the Stopping state if you are querying it in the meantime, which will be visible in e.g. the console. It's just that the force-terminate API endpoint does not return until the instance has advanced to Stopped.

@hawkw hawkw marked this pull request as draft October 9, 2024 20:17
@hawkw
Copy link
Member Author

hawkw commented Oct 9, 2024

Turning this into a draft, as we've recently come to the conclusion that we might be better off not having this kind of API at all: #4004 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state
3 participants