Failed instances should be allowed to stop and restart #2825

gjcolombo · 2023-04-12T19:38:24Z

Currently Nexus accepts no attempts to change the state of a Failed instance:

Lines 395 to 416 in ee0aac0

    
           fn check_runtime_change_allowed( 
        
               &self, 
        
               runtime: &nexus::InstanceRuntimeState, 
        
           ) -> Result<(), Error> { 
        
               // Users are allowed to request a start or stop even if the instance is 
        
               // already in the desired state (or moving to it), and we will issue a 
        
               // request to the SA to make the state change in these cases in case the 
        
               // runtime state we saw here was stale.  However, users are not allowed 
        
               // to change the state of an instance that's migrating, failed or 
        
               // destroyed. 
        
               let allowed = match runtime.run_state { 
        
                   InstanceState::Creating => true, 
        
                   InstanceState::Starting => true, 
        
                   InstanceState::Running => true, 
        
                   InstanceState::Stopping => true, 
        
                   InstanceState::Stopped => true, 
        
                   InstanceState::Rebooting => true, 
        
                   InstanceState::Migrating => false, 
        
                   InstanceState::Repairing => false, 
        
                   InstanceState::Failed => false, 
        
                   InstanceState::Destroyed => false, 
        
               };

There are plenty of reasons an instance could move to the Failed state (e.g. a failure to start the VM in Propolis, a heartbeat failure like those discussed in #2727, etc.). A VM user needs to be able to stop and attempt to restart a failed instance.

(Note that, on the Propolis end, once an instance has failed, it can't be restarted--the Propolis zone needs to be destroyed and recreated.)

askfongjojo · 2023-07-25T21:42:32Z

One of the things I think will be helpful is to avoid automatically setting instance state to Failed when it doesn't fall into the normal lifecycle state transition. This is especially true when a VM instance is wiped during an unintended propolis zone clean-up (e.g. because of a sled-agent crash and restart, or an orderly shutdown of a sled for maintenance before instance migration can take place). This class of scenarios fit with the condition of a stopped instance because the propolis zone is already gone, unlike failed instances that are subject to the condition mentioned above, i.e.:

on the Propolis end, once an instance has failed, it can't be restarted--the Propolis zone needs to be destroyed and recreated

PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).

hawkw · 2024-09-24T17:24:03Z

#6455 allowed failed instances to be restarted. I'm currently working on allowing them to be stopped as well.

PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).

gjcolombo added the nexus Related to nexus label Apr 12, 2023

gjcolombo added this to the MVP milestone Apr 12, 2023

jordanhendricks mentioned this issue May 24, 2023

sled agent should terminate Propolis zones when Propolis indicates a previously-started VM has gone missing #3209

Closed

smklein mentioned this issue Jul 25, 2023

Tracking: Instance Lifecycle Overhaul #3742

Open

13 tasks

gjcolombo mentioned this issue Oct 7, 2023

Revisit moving instances to Failed in handle_instance_put_result #4226

Closed

gjcolombo mentioned this issue Jan 23, 2024

need a way to trigger cleanup and next steps for vanished instances #4872

Closed

askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024

askfongjojo modified the milestones: MVP, 8 Mar 9, 2024

hawkw self-assigned this Mar 28, 2024

morlandi7 modified the milestones: 8, 9 May 2, 2024

morlandi7 modified the milestones: 9, 10 Jul 1, 2024

morlandi7 modified the milestones: 10, 11 Aug 14, 2024

hawkw mentioned this issue Sep 24, 2024

[nexus] want to allow stopping Failed instances #6640

Closed

hawkw mentioned this issue Sep 24, 2024

[nexus] Allow stopping Failed instances #6652

Merged

hawkw closed this as completed in #6652 Sep 24, 2024

hawkw closed this as completed in 0c7fb27 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed instances should be allowed to stop and restart #2825

Failed instances should be allowed to stop and restart #2825

gjcolombo commented Apr 12, 2023

askfongjojo commented Jul 25, 2023

hawkw commented Sep 24, 2024

Failed instances should be allowed to stop and restart #2825

Failed instances should be allowed to stop and restart #2825

Comments

gjcolombo commented Apr 12, 2023

askfongjojo commented Jul 25, 2023

hawkw commented Sep 24, 2024