Propolis zone cleanup could happen outside of `InstanceRunner` #5237

gjcolombo · 2024-03-09T00:31:22Z

Every instance managed by a sled agent has a "runner" task and a "monitor" task. All requests to do anything with the instance (change its state, forcibly remove it from the sled agent, etc.) need to execute on the runner task, which handles these requests sequentially. This includes the processing of state change messages from Propolis, which are sent from the monitor task to the runner task and processed here:

omicron/sled-agent/src/instance.rs

Lines 374 to 389 in 8697f39

    
           Some(Update { state, tx }) => { 
        
               let observed = ObservedPropolisState::new( 
        
                   self.state.instance(), 
        
                   &state, 
        
               ); 
        
               let reaction = self.observe_state(&observed).await; 
        
               self.publish_state_to_nexus().await; 
        
               // NOTE: If we fail to send here, the 
        
               // InstanceMonitorRunner has stopped running for 
        
               // some reason. We'd presumably handle that on the 
        
               // next iteration of the loop. 
        
               if let Err(_) = tx.send(reaction) { 
        
                   warn!(self.log, "InstanceRunner failed to send to InstanceMonitorRunner"); 
        
               } 
        
           },

If Propolis indicated that it shut down, InstanceRunner::observe_state will call InstanceRunner::terminate before returning and allowing new state to be published to Nexus:

omicron/sled-agent/src/instance.rs

Lines 600 to 615 in 8697f39

    
               // If the zone is now safe to terminate, tear it down and discard the 
        
               // instance ticket before returning and publishing the new instance 
        
               // state to Nexus. This ensures that the instance is actually gone from 
        
               // the sled when Nexus receives the state update saying it's actually 
        
               // destroyed. 
        
               match action { 
        
                   Some(InstanceAction::Destroy) => { 
        
                       info!(self.log, "terminating VMM that has exited"; 
        
                             "instance_id" => %self.id()); 
        
                       self.terminate().await; 
        
                       Reaction::Terminate 
        
                   } 
        
                   None => Reaction::Continue, 
        
               } 
        
           }

Terminating an instance this way collects a zone bundle from the runner thread before removing the instance from the sled agent's InstanceManager and tearing down the zone:

omicron/sled-agent/src/instance.rs

Lines 759 to 793 in 8697f39

    
           // Take a zone bundle whenever this instance stops. 
        
           if let Err(e) = self 
        
               .zone_bundler 
        
               .create( 
        
                   &running_state.running_zone, 
        
                   ZoneBundleCause::TerminatedInstance, 
        
               ) 
        
               .await 
        
           { 
        
               error!( 
        
                   self.log, 
        
                   "Failed to take zone bundle for terminated instance"; 
        
                   "zone_name" => &zname, 
        
                   "reason" => ?e, 
        
               ); 
        
           } 
        
           // Ensure that no zone exists. This succeeds even if no zone was ever 
        
           // created. 
        
           // NOTE: we call`Zones::halt_and_remove_logged` directly instead of 
        
           // `RunningZone::stop` in case we're called between creating the 
        
           // zone and assigning `running_state`. 
        
           warn!(self.log, "Halting and removing zone: {}", zname); 
        
           Zones::halt_and_remove_logged(&self.log, &zname).await.unwrap(); 
        
           // Remove ourselves from the instance manager's map of instances. 
        
           self.instance_ticket.deregister(); 
        
           // See if there are any runtime objects to clean up. 
        
           // 
        
           // We already removed the zone above but mark it as stopped 
        
           running_state.running_zone.stop().await.unwrap(); 
        
           // Remove any OPTE ports from the port manager. 
        
           running_state.running_zone.release_opte_ports();

Creating a zone bundle may be an expensive operation, since it (potentially) has to copy and compress many different log files and command outputs. This causes a couple of problems:

Instance stop will appear to take much longer than it actually does, because the relevant state transition doesn't go to Nexus until the zone is destroyed.
More importantly, the instance remains in the sled agent's instance table, but the InstanceRunner is completely blocked while all this work is going on. This means that sled agent API calls targeting an instance in this state are highly likely to time out waiting for the runner to respond. If the caller is Nexus and the request is to change the instance's state, this can cause instances to be marked as Failed (due to the error conversion rules described in sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238) even though they would correctly go to Stopped if left alone.

The reason I made these operations happen in this order (zone bundle collection -> zone teardown -> deregister instance -> publish to Nexus) was to try to mitigate #3325. I suspect, though, that that issue is not as much of a problem now that every newly-started instance gets a fresh Propolis ID (not the case prior to #4194), such that every incarnation of an instance on a sled will get a distinct zone name. If that's so, then it should be possible to mitigate these problems by changing what happens on VMM shutdown: the runner can remove the instance from the table, publish the new VMM state to Nexus, and then hand the defunct zone off to some other task to be cleaned up.¹

Cleaning up the defunct zone outside of the InstanceRunner task allows that task to return immediately and produce a "runner task closed" error message for anyone who happened to have requested something of the runner while it was in the "observe state change" arm of its request handler. If zone cleanup happens on the runner, Nexus's instance state will still update right away, but calls that land in this window are much more likely to time out than to get a "clean" instance gone error. ↩

The text was updated successfully, but these errors were encountered:

gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management labels Mar 9, 2024

gjcolombo mentioned this issue Mar 9, 2024

Instance going into failed state after hitting instance stop timeout #5235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propolis zone cleanup could happen outside of `InstanceRunner` #5237

Propolis zone cleanup could happen outside of `InstanceRunner` #5237

gjcolombo commented Mar 9, 2024

Propolis zone cleanup could happen outside of InstanceRunner #5237

Propolis zone cleanup could happen outside of InstanceRunner #5237

Comments

gjcolombo commented Mar 9, 2024

Footnotes

Propolis zone cleanup could happen outside of `InstanceRunner` #5237

Propolis zone cleanup could happen outside of `InstanceRunner` #5237