Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed instances should be allowed to stop and restart #2825

Closed
gjcolombo opened this issue Apr 12, 2023 · 2 comments · Fixed by #6652
Closed

Failed instances should be allowed to stop and restart #2825

gjcolombo opened this issue Apr 12, 2023 · 2 comments · Fixed by #6652
Assignees
Labels
known issue To include in customer documentation and training nexus Related to nexus
Milestone

Comments

@gjcolombo
Copy link
Contributor

Currently Nexus accepts no attempts to change the state of a Failed instance:

fn check_runtime_change_allowed(
&self,
runtime: &nexus::InstanceRuntimeState,
) -> Result<(), Error> {
// Users are allowed to request a start or stop even if the instance is
// already in the desired state (or moving to it), and we will issue a
// request to the SA to make the state change in these cases in case the
// runtime state we saw here was stale. However, users are not allowed
// to change the state of an instance that's migrating, failed or
// destroyed.
let allowed = match runtime.run_state {
InstanceState::Creating => true,
InstanceState::Starting => true,
InstanceState::Running => true,
InstanceState::Stopping => true,
InstanceState::Stopped => true,
InstanceState::Rebooting => true,
InstanceState::Migrating => false,
InstanceState::Repairing => false,
InstanceState::Failed => false,
InstanceState::Destroyed => false,
};

There are plenty of reasons an instance could move to the Failed state (e.g. a failure to start the VM in Propolis, a heartbeat failure like those discussed in #2727, etc.). A VM user needs to be able to stop and attempt to restart a failed instance.

(Note that, on the Propolis end, once an instance has failed, it can't be restarted--the Propolis zone needs to be destroyed and recreated.)

@askfongjojo
Copy link

One of the things I think will be helpful is to avoid automatically setting instance state to Failed when it doesn't fall into the normal lifecycle state transition. This is especially true when a VM instance is wiped during an unintended propolis zone clean-up (e.g. because of a sled-agent crash and restart, or an orderly shutdown of a sled for maintenance before instance migration can take place). This class of scenarios fit with the condition of a stopped instance because the propolis zone is already gone, unlike failed instances that are subject to the condition mentioned above, i.e.:

on the Propolis end, once an instance has failed, it can't be restarted--the Propolis zone needs to be destroyed and recreated

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024
@askfongjojo askfongjojo modified the milestones: MVP, 8 Mar 9, 2024
@hawkw hawkw self-assigned this Mar 28, 2024
@morlandi7 morlandi7 modified the milestones: 8, 9 May 2, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 1, 2024
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 14, 2024
hawkw added a commit that referenced this issue Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
@hawkw
Copy link
Member

hawkw commented Sep 24, 2024

#6455 allowed failed instances to be restarted. I'm currently working on allowing them to be stopped as well.

hawkw added a commit that referenced this issue Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
@hawkw hawkw closed this as completed in 0c7fb27 Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training nexus Related to nexus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants