[nexus] add `POST /v1/instances/{instance}/force-terminate` #6795

hawkw · 2024-10-07T19:54:50Z

Today, requests to stop a running instance must by necessity involve the
instance's active Propolis (sled agent sends the stop request to
Propolis; the instance is stopped when Propolis says so, at which point
sled agent cleans up the Propolis zone and all related objects). If an
instance's Propolis is not responding, or there is no active Propolis,
there is no obvious way to clean up the instance and recover.

In order to allow resolving otherwise stuck instances, this commit
introduces a new API endpoint, POST /v1/instances/{instance}/force-terminate, which calls directly into the
sled-agent's instance-ensure-unregistered API, telling it to rudely
terminate the VMM and destroy the Propolis zone, without waiting for it
to shut down politely.

The one complex-ish bit is the issue I fixed in commit 34c3058 around
what this API does in the case where the sled-agent has forgotten an
instance due to an unexpected restart. The sled-agent
instance-ensure-unregistered API is idempotent, which is right and
proper...but it introduces an issue in the case of a forgotten VMM after
a sled-agent restart. The instance-ensure-unregistered call will
return None because sled-agent doesn't know about that instance, but
if this is the first time we have discovered that sled-agent doesn't
know about the instance, we will need to move it to Failed. This is
okay to do, because the VMM generation number guards against the case
where we have raced with another instance-force-terminate call. We
will only move the VMM to Failed in that case if no one else has moved
it to Destroyed as the result of a successful
instance-ensure-unregistered in the interim.

The sled-agent instance-ensure-unregistered API is idempotent, which is right and proper...but it introduces an issue in the case of a forgotten VMM after a sled-agent restart. The instance-ensure-unregistered call will return `None` because sled-agent doesn't know about that instance, but if this is the first time we have discovered that sled-agent doesn't know about the instance, we will need to move it to `Failed`. This commit fixes that.

nexus/external-api/src/lib.rs

nexus/src/app/instance.rs

gjcolombo · 2024-10-07T20:41:43Z

nexus/src/app/instance.rs

+        let unregister_result =
+            self.instance_ensure_unregistered(&propolis_id, &sled_id).await;


Does the start saga clean up properly if this happens between its ensure_registered and ensure_running steps? I think it works out: sis_ensure_running will fail; sis_ensure_registered_undo will also fail and try to move the VMM to Failed, but this doesn't block the saga from continuing to unwind; then I think we'll end up in SagaUnwound and end up with a VM that can be started again. Does that sound about right? If so, it might be worthwhile to add a comment mentioning this case.

I think so, but I would like to figure out whether it's possible to test this...

Okay, upon further inspection, I believe your assessment is correct and this is fine.

I think that sis_ensure_registered_undo failing to move the VMM to Failed will get the saga stuck:

omicron/nexus/src/app/sagas/instance_start.rs

Lines 692 to 711 in 2dcf896

match e {

InstanceStateChangeError::SledAgent(inner) if inner.vmm_gone() => {

error!(osagactx.log(),

"start saga: failing instance after unregister failure";

"instance_id" => %instance_id,

"start_reason" => ?params.reason,

"error" => ?inner);

if let Err(set_failed_error) = osagactx

.nexus()

.mark_vmm_failed(&opctx, authz_instance, &db_vmm, &inner)

.await

{

error!(osagactx.log(),

"start saga: failed to mark instance as failed";

"instance_id" => %instance_id,

"start_reason" => ?params.reason,

"error" => ?set_failed_error);

Err(set_failed_error.into())

However, mark_vmm_failed won't actually fail here if the VMM's state generation is stale, it will just scream about it:

omicron/nexus/src/app/instance.rs

Lines 1465 to 1471 in 2dcf896

// XXX: It's not clear what to do with this error; should it be

// bubbled back up to the caller?

Err(e) => error!(self.log,

"failed to write Failed instance state to DB";

"instance_id" => %instance_id,

"vmm_id" => %vmm_id,

"error" => ?e),

So, I think we're in the clear here. But, I feel a bit uncomfortable about this, because it seems like a change to the mark_vmm_failed error path to actually return an error could introduce a bug here, and I'm not immediately sure if there's an obvious way to write a regression test that would fail on such a change. Any ideas?

Can we do something with error types instead? That is: have mark_vmm_failed return its own error enum with "DB query failed" and "update too old" variants, and have sis_ensure_registered_undo match on specific error codes from this callee, such that a new failure mode will break the match. WDYT? It'd be nice to have an integration test, too, but I'm similarly having trouble figuring out how to inject a failure into this specific call, since we don't have a CRDB test double that I know of.

Hm, actually, the write to CRDB in mark_vmm_failed is done by a call to vmm_update_runtime, which returns Ok(false) if the VMM exists but wasn't updated (e.g. if the generation has advanced). So, we don't even hit the error path (which gets ignored anyway) in mark_vmm_failed in that case...

Which, now that I look at it, indicates a tiny possible inefficiency in mark_vmm_failed --- currently, we try to run an update saga even if the VMM did not move to failed (because it was already failed/destroyed), which means we will probably start spurious update sagas there. I'll clean that up.

I think I'm probably fine with having mark_vmm_failed stick with the convention of returning Ok(false) if the generation has changed, instead of an error variant, since it seems like we generally follow that convention for similar code. On one hand, I do have a sort of ideological preference for an Ok(_) to always mean "yes, we actually wrote the desired update to CRDB", but on the other hand, a lot of saga undo actions probably just ? these functions, and would probably prefer to get an Ok in any case that doesn't mean they should unwind. I dunno.

mark_vmm_failed actually has the type signature of async fn(...) -> Result<(), Error> but there are actually no conditions in which it will ever currently return an error. That's cool.

I might just change the type signature to not return Result

Upon further perusal of sis_ensure_registered_undo, I noticed that it was actually handling sled-agent unregister errors in a pretty outdated way: it was treating vmm_gone errors as a probable failure, and just ignoring any other errors communicating with the sled-agent, as described in this comment:

omicron/nexus/src/app/sagas/instance_start.rs

Lines 673 to 691 in 0640bb2

// If the failure came from talking to sled agent, and the error code

// indicates the instance or sled might be unhealthy, manual

// intervention is likely to be needed, so try to mark the instance as

// Failed and then bail on unwinding.

//

// If sled agent is in good shape but just doesn't know about the

// instance, this saga still owns the instance's state, so allow

// unwinding to continue.

//

// If some other Nexus error occurred, this saga is in bad shape, so

// return an error indicating that intervention is needed without trying

// to modify the instance further.

//

// TODO(#3238): `instance_unhealthy` does not take an especially nuanced

// view of the meanings of the error codes sled agent could return, so

// assuming that an error that isn't `instance_unhealthy` means

// that everything is hunky-dory and it's OK to continue unwinding may

// be a bit of a stretch. See the definition of `instance_unhealthy` for

// more details.

This predates the changes to vmm_gone/instance_unhealthy semantics from RFD 486/#6455, and isn't quite correct. I've changed it to make the "other error from sled agent" be the case that fails the saga, although we should probably retry communication errors eventually.

The changes to mark_vmm_failed and sis_ensure_registered_undo are in a923f2a. Arguably, it's not really in scope for this PR at that point; I'd be happy to cherry-pick it out to a separate branch if you think that's worth doing?

Co-authored-by: Greg Colombo <[email protected]>

hawkw · 2024-10-08T18:26:46Z

@gjcolombo I've changed this so that we now also tear down migration target VMMs (see d6a2bd4).

In the process, I also changed the code for terminating a VMM to also synchronously attempt to run an update saga to completion (e841bd0). This came up because I noticed that, when force-terminating a Migrating instance, the returned instance will still appear to be Migrating even though both VMMs have been destroyed, because it still has a migration ID set until an update saga removes it. We had decided to do that to stop migrating VMMs from briefly appearing to be Stopped when a migration completes and an active VMM is destroyed, but the instance hasn't yet been updated to reflect that it is now running on the target. I think this is still correct behavior, but it seemed kinda sad to "force terminate" an instance and see that the force-terminate succeeded but it's still Migrating, so I made the force-terminate method run the whole saga.

nexus/src/app/instance.rs

gjcolombo · 2024-10-08T23:36:24Z

nexus/src/app/instance.rs

+        let unregister_result =
+            self.instance_ensure_unregistered(&propolis_id, &sled_id).await;


Can we do something with error types instead? That is: have mark_vmm_failed return its own error enum with "DB query failed" and "update too old" variants, and have sis_ensure_registered_undo match on specific error codes from this callee, such that a new failure mode will break the match. WDYT? It'd be nice to have an integration test, too, but I'm similarly having trouble figuring out how to inject a failure into this specific call, since we don't have a CRDB test double that I know of.

Co-authored-by: Greg Colombo <[email protected]>

iximeow · 2024-10-09T17:16:26Z

nexus/tests/integration_tests/instances.rs

+    // Okay, what if the instance is already gone?
+    let instance = dbg!(
+        instance_post(&client, &already_gone_name, InstanceOp::ForceTerminate)
+            .await
+    );
+    // This time, the instance will go to `Failed` rather than `Stopped` since
+    // sled-agent is no longer aware of it.
+    assert_eq!(instance.runtime.run_state, InstanceState::Failed);


fwiw this initially seemed concerning ("if you force-stop an instance in the right state, now you have a failed instance?"), but i think i've figured out the preconditions by reading more carefully: this is specifically if you force-stop an instance where Nexus believes a sled-agent is responsible for the instance, but that sled-agent doesn't know it exists, and that already is a sign that something has gone wrong. so it's an instance that probably should be failed anyway, we just hadn't figured that out yet?

if it helps on wording, i think "already gone" is what primed me to think this might be applicable in some kind of race around stopping.

that said, i think there's a race where if you force-stop an instance alongside a call to instance_stop you can end up with an instance in Failed where auto-restart would just turn it back on again? specifically if instance_stop has reached out to sled-agent, sled-agent responds that it has stopped the instance, but we haven't actually recorded that state update yet, then force_stop could show up and see Nexus thinks that sled-agent has the VMM while that sled-agent thinks it's gone.

i haven't looked at the instance_stop code much yet so maybe i'm misunderstanding and there's not actually something there..? but if i'm following these bits right, it seems kind of unfortunate that trying too enthusiastically to stop an instance could end up making it auto-restart.

Multiple racing stop attempts shouldn't transition an instance to failed, since the query in mark_vmm_failed that transitions the VMM to failed is conditional on the sequence number of the VMM record remaining the same as when the VMM was read. If the VMM was destroyed between when we read the record initially and when we asked the sled-agent to destroy it, the sequence number will have advanced, and we won't move it to failed. It only moves to failed if it was already nonexistent.

I see your point about auto-restarts, though: perhaps we should just always move a force-stopped instance to Stopped when encountering a sled-agent error that would otherwise move it to Failed, since you're trying to stop it anyway, which presumably means you have other plans for the instance? I dunno. @gjcolombo, what do you think?

It occurs to me that we should probably also bail out early if we see the VMM record is already Destroyed, though, rather than asking the sled-agent to destroy it again...

ok! i see your point about the non-raciness now - if there's still a Propolis ID for an instance, the force-terminate call might cause Nexus to get to instance_ensure_unregistered and in fact ask sled-agent to ensure the Propolis has been terminated, but in that case sled-agent's ensure_unregistered would return VmmUnregisterResponse { updated_runtime: None } rather than an error that the Propolis that should be unregistered is already gone. to the point of sled-agent docs, ensure_unregistered is dempotent.

so, among other things, new Err(InstanceStateChangeError::SledAgent(e)) if e.vmm_gone() arm in instance.rs shouldn't ever be reached, and that case of marking an instance failed doesn't matter so much.

fully convinced it's not racy now, thanks for bearing with me.

iximeow · 2024-10-09T17:42:56Z

nexus/external-api/src/lib.rs

+    /// "stopped" state without transitioning through the "stopping" state.
+    /// This operation can be used to recover an instance that is not
+    /// responding to requests to stop issued through the instance stop API.


putting my mildly-uninformed-user hat on, is there something important i could be missing out on by not transitioning through "stopping"? resources that could be leaked (seems unlikely), or internal instance state that could get weird (seems more likely)? from what's here it doesn't seem unreasonable for a user to always /force-terminate on the assumption that it's more like yanking power, and i dunno how much anyone would be disturbed by that. i recognize this is also kind of the ambiguity @gjcolombo was trying to address, sorry :)

putting my Oxide engineer hat on, it feels like any reason to use /force-terminate is a result of a user trying to unwedge themselves from an Oxide bug. so maybe that's the kind of warding this documentation deserves? though i'm still not sure how load-bearing stopping is..

The instance will still transition through the Stopping state if you are querying it in the meantime, which will be visible in e.g. the console. It's just that the force-terminate API endpoint does not return until the instance has advanced to Stopped.

hawkw · 2024-10-09T20:18:12Z

Turning this into a draft, as we've recently come to the conclusion that we might be better off not having this kind of API at all: #4004 (comment)

hawkw added 4 commits October 4, 2024 10:40

plumbing for instance_force_terminate

3de530f

implement force terminate

afcd4b8

add tests

360c9cd

hawkw requested review from iximeow and gjcolombo October 7, 2024 19:54

gjcolombo reviewed Oct 7, 2024

View reviewed changes

hawkw and others added 4 commits October 7, 2024 16:07

Update nexus/external-api/src/lib.rs

231c59f

Co-authored-by: Greg Colombo <[email protected]>

i guess i needed to clobber those

a86935a

terminate active and target vmms

d6a2bd4

run instance-update saga synchronously

e841bd0

hawkw added 2 commits October 8, 2024 11:30

add a note about running 2 sagas

a69740d

can't believe i forgot to do that

e632bed

hawkw requested a review from gjcolombo October 8, 2024 21:58

gjcolombo reviewed Oct 8, 2024

View reviewed changes

hawkw and others added 3 commits October 8, 2024 16:38

Update instance.rs

e50af89

Co-authored-by: Greg Colombo <[email protected]>

only run one saga (@gjcolombo is smarter than me)

d69a1ce

cleanup mark_vmm_failed and sis_ensure_registered_undo

a923f2a

hawkw requested a review from gjcolombo October 9, 2024 16:41

iximeow reviewed Oct 9, 2024

View reviewed changes

hawkw marked this pull request as draft October 9, 2024 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] add `POST /v1/instances/{instance}/force-terminate` #6795

[nexus] add `POST /v1/instances/{instance}/force-terminate` #6795

hawkw commented Oct 7, 2024

gjcolombo Oct 7, 2024

hawkw Oct 7, 2024

hawkw Oct 8, 2024

gjcolombo Oct 8, 2024

hawkw Oct 9, 2024

hawkw Oct 9, 2024

hawkw Oct 9, 2024

hawkw Oct 9, 2024

hawkw Oct 9, 2024

hawkw commented Oct 8, 2024

gjcolombo Oct 8, 2024

iximeow Oct 9, 2024

hawkw Oct 9, 2024

hawkw Oct 9, 2024

iximeow Oct 9, 2024

iximeow Oct 9, 2024

hawkw Oct 9, 2024

hawkw commented Oct 9, 2024

		let unregister_result =
		self.instance_ensure_unregistered(&propolis_id, &sled_id).await;

	match e {
	InstanceStateChangeError::SledAgent(inner) if inner.vmm_gone() => {
	error!(osagactx.log(),
	"start saga: failing instance after unregister failure";
	"instance_id" => %instance_id,
	"start_reason" => ?params.reason,
	"error" => ?inner);

	if let Err(set_failed_error) = osagactx
	.nexus()
	.mark_vmm_failed(&opctx, authz_instance, &db_vmm, &inner)
	.await
	{
	error!(osagactx.log(),
	"start saga: failed to mark instance as failed";
	"instance_id" => %instance_id,
	"start_reason" => ?params.reason,
	"error" => ?set_failed_error);

	Err(set_failed_error.into())

	// XXX: It's not clear what to do with this error; should it be
	// bubbled back up to the caller?
	Err(e) => error!(self.log,
	"failed to write Failed instance state to DB";
	"instance_id" => %instance_id,
	"vmm_id" => %vmm_id,
	"error" => ?e),

	// If the failure came from talking to sled agent, and the error code
	// indicates the instance or sled might be unhealthy, manual
	// intervention is likely to be needed, so try to mark the instance as
	// Failed and then bail on unwinding.
	//
	// If sled agent is in good shape but just doesn't know about the
	// instance, this saga still owns the instance's state, so allow
	// unwinding to continue.
	//
	// If some other Nexus error occurred, this saga is in bad shape, so
	// return an error indicating that intervention is needed without trying
	// to modify the instance further.
	//
	// TODO(#3238): `instance_unhealthy` does not take an especially nuanced
	// view of the meanings of the error codes sled agent could return, so
	// assuming that an error that isn't `instance_unhealthy` means
	// that everything is hunky-dory and it's OK to continue unwinding may
	// be a bit of a stretch. See the definition of `instance_unhealthy` for
	// more details.

[nexus] add POST /v1/instances/{instance}/force-terminate #6795

Are you sure you want to change the base?

[nexus] add POST /v1/instances/{instance}/force-terminate #6795

Conversation

hawkw commented Oct 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw commented Oct 9, 2024

[nexus] add `POST /v1/instances/{instance}/force-terminate` #6795

[nexus] add `POST /v1/instances/{instance}/force-terminate` #6795