-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should instance-stop
*really* move instances to failed when they are discovered to already be gone?
#6809
Comments
I don't think the instance should ever turn on if i told it to be stopped. If "failed" is useful information for the user in this case, perhaps a "failed but don't auto restart" would be a useful backend state to create. But I'm not sure if it is. What would I do if I told my machine to stop, and it said "failed"? I might think "maybe it failed to stop. is it still running? well I can't get to it on the network, so at least services don't seem to be running, or maybe it tore down networking. ???" Or I might think "well it's not running. but did I break something in the backend I need to worry about?" |
I agree with pretty much all of the above. The piece of information that gets lost by transitioning to Stopped (and not Failed) when a VMM is gone is the unexpectedness of the VMM's absence. The tradeoff is that (by default) a Failed instance is eligible to be restarted automatically, and I agree that this doesn't make very much sense if we know for certain that the user intends for the VM not to be running, which is the case in the context of a Stop API call. It seems way more sensible to drop the "oops it didn't reach the end of the ride" bit than to force a user to do more work to keep an instance from restarting.1 In terms of what would need to change in Nexus to do this: I think some care is required because an instance-stop API call can race with the instance watcher to determine the final state of a missing VMM:
I'd have to think about what to do in this case--does the Stop API call fail (with a 409 or something like that), or does it try to advance the VMM record to Destroyed at generation 12 and then succeed? If the latter, what if the reincarnation task gets to the instance in the meantime and restarts it? Then you have a Stop request that returns a Running instance... It might be simplest just to fail if the stop API was not able to take some meaningful action to drive the instance's state to Stopped. Assuming these practical details sound OK, I'm on board with this conception of the purpose of Footnotes
|
The
PUT /instances/{instance}/stop
API will send a request to sled-agent asking it to terminate a running instance. If sled-agent responds with something indicating that it actually didn't know about that instance in the first place, Nexus will then transition it toFailed
:omicron/nexus/src/app/instance.rs
Lines 755 to 775 in 0640bb2
This will not happen when the reason the instance is gone is because another concurrent attempt to stop it has succeeded, because the racing stop attempt will have advanced the VMM's generation number whilst moving it to
Destroyed
, so we won't mark it asFailed
. However, in the event of a sled-agent crash, we may encounter an already-gone VMM here, and may move it toFailed
.This seems a bit wacky to me, since
Failed
instances (which is what the instance will eventually become as a result of its VMM being markedFailed
) are eligible to be auto-restarted, whileStopped
instances are not --- because the user actually wanted them to be stopped. And, in this case, the user is expressing intent to have an instance stop running, and we just happened to discover that we had already anticipated their desire to stop it and went ahead and stopped it for them before they even asked us to. Admittedly, we weren't supposed to have done that! But, in this case, the requested state is "instance is not running", and it's not running, so it seems a bit unfortunate to go "oh no, i was supposed to make the instance not be running, and when i tried to do that, i discovered that it was not running because we made a mistake, so now i'm actually going to...make it be running again?"Imagine a scenario where a user goes to stop an instance so that they can change its boot disk or something, and while doing so, we discover that sled-agent has crashed and the instance isn't there. Moving it to
Failed
results in the instance being restarted, so now the user has to stop the instance a second time before they can actually do what they were trying to do originally.Maybe we should just always move the instance to stopped when such an error is encountered by an instance-stop attempt. Obviously we would still go to
Failed
when attempting to do other things to the instance.The text was updated successfully, but these errors were encountered: