-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] add POST /v1/instances/{instance}/force-terminate
#6795
base: main
Are you sure you want to change the base?
Conversation
The sled-agent instance-ensure-unregistered API is idempotent, which is right and proper...but it introduces an issue in the case of a forgotten VMM after a sled-agent restart. The instance-ensure-unregistered call will return `None` because sled-agent doesn't know about that instance, but if this is the first time we have discovered that sled-agent doesn't know about the instance, we will need to move it to `Failed`. This commit fixes that.
let unregister_result = | ||
self.instance_ensure_unregistered(&propolis_id, &sled_id).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the start saga clean up properly if this happens between its ensure_registered
and ensure_running
steps? I think it works out: sis_ensure_running
will fail; sis_ensure_registered_undo
will also fail and try to move the VMM to Failed, but this doesn't block the saga from continuing to unwind; then I think we'll end up in SagaUnwound and end up with a VM that can be started again. Does that sound about right? If so, it might be worthwhile to add a comment mentioning this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but I would like to figure out whether it's possible to test this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, upon further inspection, I believe your assessment is correct and this is fine.
I think that sis_ensure_registered_undo
failing to move the VMM to Failed
will get the saga stuck:
omicron/nexus/src/app/sagas/instance_start.rs
Lines 692 to 711 in 2dcf896
match e { | |
InstanceStateChangeError::SledAgent(inner) if inner.vmm_gone() => { | |
error!(osagactx.log(), | |
"start saga: failing instance after unregister failure"; | |
"instance_id" => %instance_id, | |
"start_reason" => ?params.reason, | |
"error" => ?inner); | |
if let Err(set_failed_error) = osagactx | |
.nexus() | |
.mark_vmm_failed(&opctx, authz_instance, &db_vmm, &inner) | |
.await | |
{ | |
error!(osagactx.log(), | |
"start saga: failed to mark instance as failed"; | |
"instance_id" => %instance_id, | |
"start_reason" => ?params.reason, | |
"error" => ?set_failed_error); | |
Err(set_failed_error.into()) |
However, mark_vmm_failed
won't actually fail here if the VMM's state generation is stale, it will just scream about it:
omicron/nexus/src/app/instance.rs
Lines 1465 to 1471 in 2dcf896
// XXX: It's not clear what to do with this error; should it be | |
// bubbled back up to the caller? | |
Err(e) => error!(self.log, | |
"failed to write Failed instance state to DB"; | |
"instance_id" => %instance_id, | |
"vmm_id" => %vmm_id, | |
"error" => ?e), |
So, I think we're in the clear here. But, I feel a bit uncomfortable about this, because it seems like a change to the mark_vmm_failed
error path to actually return an error could introduce a bug here, and I'm not immediately sure if there's an obvious way to write a regression test that would fail on such a change. Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something with error types instead? That is: have mark_vmm_failed
return its own error enum with "DB query failed" and "update too old" variants, and have sis_ensure_registered_undo
match on specific error codes from this callee, such that a new failure mode will break the match. WDYT? It'd be nice to have an integration test, too, but I'm similarly having trouble figuring out how to inject a failure into this specific call, since we don't have a CRDB test double that I know of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually, the write to CRDB in mark_vmm_failed
is done by a call to vmm_update_runtime
, which returns Ok(false)
if the VMM exists but wasn't updated (e.g. if the generation has advanced). So, we don't even hit the error path (which gets ignored anyway) in mark_vmm_failed
in that case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which, now that I look at it, indicates a tiny possible inefficiency in mark_vmm_failed
--- currently, we try to run an update saga even if the VMM did not move to failed (because it was already failed/destroyed), which means we will probably start spurious update sagas there. I'll clean that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm probably fine with having mark_vmm_failed
stick with the convention of returning Ok(false)
if the generation has changed, instead of an error variant, since it seems like we generally follow that convention for similar code. On one hand, I do have a sort of ideological preference for an Ok(_)
to always mean "yes, we actually wrote the desired update to CRDB", but on the other hand, a lot of saga undo actions probably just ?
these functions, and would probably prefer to get an Ok
in any case that doesn't mean they should unwind. I dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark_vmm_failed
actually has the type signature of async fn(...) -> Result<(), Error>
but there are actually no conditions in which it will ever currently return an error. That's cool.
I might just change the type signature to not return Result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further perusal of sis_ensure_registered_undo
, I noticed that it was actually handling sled-agent unregister errors in a pretty outdated way: it was treating vmm_gone
errors as a probable failure, and just ignoring any other errors communicating with the sled-agent, as described in this comment:
omicron/nexus/src/app/sagas/instance_start.rs
Lines 673 to 691 in 0640bb2
// If the failure came from talking to sled agent, and the error code | |
// indicates the instance or sled might be unhealthy, manual | |
// intervention is likely to be needed, so try to mark the instance as | |
// Failed and then bail on unwinding. | |
// | |
// If sled agent is in good shape but just doesn't know about the | |
// instance, this saga still owns the instance's state, so allow | |
// unwinding to continue. | |
// | |
// If some other Nexus error occurred, this saga is in bad shape, so | |
// return an error indicating that intervention is needed without trying | |
// to modify the instance further. | |
// | |
// TODO(#3238): `instance_unhealthy` does not take an especially nuanced | |
// view of the meanings of the error codes sled agent could return, so | |
// assuming that an error that isn't `instance_unhealthy` means | |
// that everything is hunky-dory and it's OK to continue unwinding may | |
// be a bit of a stretch. See the definition of `instance_unhealthy` for | |
// more details. |
This predates the changes to vmm_gone
/instance_unhealthy
semantics from RFD 486/#6455, and isn't quite correct. I've changed it to make the "other error from sled agent" be the case that fails the saga, although we should probably retry communication errors eventually.
The changes to mark_vmm_failed
and sis_ensure_registered_undo
are in a923f2a. Arguably, it's not really in scope for this PR at that point; I'd be happy to cherry-pick it out to a separate branch if you think that's worth doing?
@gjcolombo I've changed this so that we now also tear down migration target VMMs (see d6a2bd4). In the process, I also changed the code for terminating a VMM to also synchronously attempt to run an update saga to completion (e841bd0). This came up because I noticed that, when force-terminating a |
let unregister_result = | ||
self.instance_ensure_unregistered(&propolis_id, &sled_id).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something with error types instead? That is: have mark_vmm_failed
return its own error enum with "DB query failed" and "update too old" variants, and have sis_ensure_registered_undo
match on specific error codes from this callee, such that a new failure mode will break the match. WDYT? It'd be nice to have an integration test, too, but I'm similarly having trouble figuring out how to inject a failure into this specific call, since we don't have a CRDB test double that I know of.
// Okay, what if the instance is already gone? | ||
let instance = dbg!( | ||
instance_post(&client, &already_gone_name, InstanceOp::ForceTerminate) | ||
.await | ||
); | ||
// This time, the instance will go to `Failed` rather than `Stopped` since | ||
// sled-agent is no longer aware of it. | ||
assert_eq!(instance.runtime.run_state, InstanceState::Failed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw this initially seemed concerning ("if you force-stop an instance in the right state, now you have a failed instance?"), but i think i've figured out the preconditions by reading more carefully: this is specifically if you force-stop an instance where Nexus believes a sled-agent is responsible for the instance, but that sled-agent doesn't know it exists, and that already is a sign that something has gone wrong. so it's an instance that probably should be failed anyway, we just hadn't figured that out yet?
if it helps on wording, i think "already gone" is what primed me to think this might be applicable in some kind of race around stopping.
that said, i think there's a race where if you force-stop an instance alongside a call to instance_stop
you can end up with an instance in Failed where auto-restart would just turn it back on again? specifically if instance_stop
has reached out to sled-agent, sled-agent responds that it has stopped the instance, but we haven't actually recorded that state update yet, then force_stop
could show up and see Nexus thinks that sled-agent has the VMM while that sled-agent thinks it's gone.
i haven't looked at the instance_stop
code much yet so maybe i'm misunderstanding and there's not actually something there..? but if i'm following these bits right, it seems kind of unfortunate that trying too enthusiastically to stop an instance could end up making it auto-restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple racing stop attempts shouldn't transition an instance to failed, since the query in mark_vmm_failed
that transitions the VMM to failed is conditional on the sequence number of the VMM record remaining the same as when the VMM was read. If the VMM was destroyed between when we read the record initially and when we asked the sled-agent to destroy it, the sequence number will have advanced, and we won't move it to failed. It only moves to failed if it was already nonexistent.
I see your point about auto-restarts, though: perhaps we should just always move a force-stopped instance to Stopped
when encountering a sled-agent error that would otherwise move it to Failed
, since you're trying to stop it anyway, which presumably means you have other plans for the instance? I dunno. @gjcolombo, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that we should probably also bail out early if we see the VMM record is already Destroyed
, though, rather than asking the sled-agent to destroy it again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! i see your point about the non-raciness now - if there's still a Propolis ID for an instance, the force-terminate
call might cause Nexus to get to instance_ensure_unregistered
and in fact ask sled-agent
to ensure the Propolis has been terminated, but in that case sled-agent
's ensure_unregistered
would return VmmUnregisterResponse { updated_runtime: None }
rather than an error that the Propolis that should be unregistered is already gone. to the point of sled-agent docs, ensure_unregistered
is dempotent.
so, among other things, new Err(InstanceStateChangeError::SledAgent(e)) if e.vmm_gone()
arm in instance.rs
shouldn't ever be reached, and that case of marking an instance failed doesn't matter so much.
fully convinced it's not racy now, thanks for bearing with me.
/// "stopped" state without transitioning through the "stopping" state. | ||
/// This operation can be used to recover an instance that is not | ||
/// responding to requests to stop issued through the instance stop API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting my mildly-uninformed-user hat on, is there something important i could be missing out on by not transitioning through "stopping"? resources that could be leaked (seems unlikely), or internal instance state that could get weird (seems more likely)? from what's here it doesn't seem unreasonable for a user to always /force-terminate
on the assumption that it's more like yanking power, and i dunno how much anyone would be disturbed by that. i recognize this is also kind of the ambiguity @gjcolombo was trying to address, sorry :)
putting my Oxide engineer hat on, it feels like any reason to use /force-terminate
is a result of a user trying to unwedge themselves from an Oxide bug. so maybe that's the kind of warding this documentation deserves? though i'm still not sure how load-bearing stopping
is..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance will still transition through the Stopping
state if you are querying it in the meantime, which will be visible in e.g. the console. It's just that the force-terminate
API endpoint does not return until the instance has advanced to Stopped
.
Turning this into a draft, as we've recently come to the conclusion that we might be better off not having this kind of API at all: #4004 (comment) |
Fixes #4004
Today, requests to stop a running instance must by necessity involve the
instance's active Propolis (sled agent sends the stop request to
Propolis; the instance is stopped when Propolis says so, at which point
sled agent cleans up the Propolis zone and all related objects). If an
instance's Propolis is not responding, or there is no active Propolis,
there is no obvious way to clean up the instance and recover.
In order to allow resolving otherwise stuck instances, this commit
introduces a new API endpoint,
POST /v1/instances/{instance}/force-terminate
, which calls directly into thesled-agent's
instance-ensure-unregistered
API, telling it to rudelyterminate the VMM and destroy the Propolis zone, without waiting for it
to shut down politely.
The one complex-ish bit is the issue I fixed in commit 34c3058 around
what this API does in the case where the sled-agent has forgotten an
instance due to an unexpected restart. The sled-agent
instance-ensure-unregistered
API is idempotent, which is right andproper...but it introduces an issue in the case of a forgotten VMM after
a sled-agent restart. The
instance-ensure-unregistered
call willreturn
None
because sled-agent doesn't know about that instance, butif this is the first time we have discovered that sled-agent doesn't
know about the instance, we will need to move it to
Failed
. This isokay to do, because the VMM generation number guards against the case
where we have raced with another instance-force-terminate call. We
will only move the VMM to
Failed
in that case if no one else has movedit to
Destroyed
as the result of a successfulinstance-ensure-unregistered
in the interim.