reincarnate instances with `SagaUnwound` VMMs #6638

hawkw · 2024-09-23T18:57:13Z

When an instance-start saga unwinds, any VMM it created transitions to the SagaUnwound state. This causes the instance's effective state to appear as Failed in the external API. PR #6503 added functionality to Nexus to automatically restart instances that are in the Failed state ("instance reincarnation"). However, this code will not automatically restart instances whose instance-start sagas have unwound, because such instances are not actually in the Failed state from Nexus' perspective.

We should be automatically restarting instances whose start sagas have failed. From the user's perspective, the purpose of instance reincarnation is "if I say that I would like this instance to be running, the system should make sure it is running" --- it doesn't really matter if the reason it's not running is because it couldn't be started, or because it was running and then something happened to it. Furthermore, it seems not great to say "instances in the Failed state will be automatically restarted" and then have some class of instances which appear to be in the Failed state in the UI but aren't actually automatically restarted. Therefore, we should add code to the instance-reincarnation RPW to reincarnate instances with SagaUnwound VMMs.

I don't really think it makes sense for the auto-restart cooldown period to apply to SagaUnwound instances even if they have failed previously. The purpose of the cooldown period is to prevent "repeat offenders" that fail repeatedly from compromising overall availability, especially when those instances fail in ways that effect other instances on the same sled. However, if an instance's start saga fails, it was never actually running,¹ so the failure was not due to a problem with the instance. Therefore, I think the cooldown period need not apply here. We may want to consider a separate cooldown period between start failures, to prevent an instance that, for some reason, consistently fail to be started (i.e. due to requesting an allocation of virtual resources which aren't actually available) from constantly being restarted in a hot loop...

The last action in the instance-start saga is to send a request to the sled-agent, and thus Propolis, to tell it to start trying to boot the instance. If the instance boots and then crashes, that will go through the normal cpapi_instances_put -> process_instance_update path that transitions the VMM to Failed rather than putting it in SagaUnwound. ↩

The text was updated successfully, but these errors were encountered:

hawkw · 2024-09-23T18:57:46Z

cc @gjcolombo, I'd love your take on the cooldown period bit.

hawkw · 2024-09-23T19:01:01Z

Honestly, we may even want to re-queue instance-start sagas for instances with the InstanceAutoRestart::Never policy (even though that sounds kinda weird). Re-attempting to start something that was never actually started is somewhat different from _re_starting from something that failed at runtime, and if an instance's policy disables automatic restarts, that means that any start saga that we ran for it was at the behest of a user's explicit action. So there's a sense in which retrying a failed instance-start saga isn't really "automatically restarting" an instance as much as it's "continuing our attempt to drive the instance to the state requested by the user".

hawkw · 2024-09-23T19:21:21Z

@gjcolombo pointed out that I'm wrong about SagaUnwound instances appearing Failed externally. At present, they actually appear to be Stopped:

omicron/nexus/db-queries/src/db/datastore/instance.rs

Lines 157 to 163 in 165ddc2

    
           // - An instance with a "saga unwound" VMM, on the other hand, can 
        
           //   be treated as "stopped", since --- unlike "destroyed" --- a new 
        
           //   start saga can run at any time by just clearing out the old VMM 
        
           //   ID. 
        
           (InstanceState::Vmm, Some(VmmState::SagaUnwound)) => { 
        
               external::InstanceState::Stopped 
        
           }

I believe that, at the time, we did that because when SagaUnwound was introduced, the Failed state could not be restarted, either manually or automatically. Now, though, if we're going to start restarting instances with SagaUnwound VMMs, SagaUnwound should probably be reported as Failed rather than Stopped, since Stopped doesn't mean "we may try to restart this".

gjcolombo · 2024-09-23T19:26:28Z

A couple things come to mind here. One is that I don't think the start saga (currently) guarantees that all transitions to SagaUnwound are transient issues. For example, if there's a bug where an instance can't start because Nexus/sled agent deterministically translate its configuration into parameters that Propolis thinks are invalid, the saga will always fail in its instance_ensure_registered step. I think it would be valuable to have both a cooldown and a maximum number of retries to help forestall retry storms.

there's a sense in which retrying a failed instance-start saga isn't really "automatically restarting" an instance as much as it's "continuing our attempt to drive the instance to the state requested by the user".

As you pointed out just above, today SagaUnwound maps to the Stopped state; if we're going to do this, we'll either need to map this state to Failed or introduce some other way of conveying whether a Stopped instance is eligible to be automatically restarted or not. (I would prefer the former, along with allowing requests to stop a Failed instance, which should just move the instance from Failed to Stopped so that it will no longer be auto-started.)

This is editorializing a bit, but in general, I think trying to reason about the user's intended instance state in the control plane is somewhat perilous, since a running guest can shut down and move its own instance to Stopped, and we don't ever really know whether that was intended or not (did the user do it on purpose? did the guest crash and halt itself? is some other malicious guest actor shutting the guest down without crashing it?).

hawkw · 2024-09-23T19:34:43Z

Yeah, I agree that, at least for now, we should apply the cooldown regardless of whether an instance is SagaUnwound or properly Failed. If we eventually start reasoning about failure reasons so that the cooldown period can consider whether an instance has failed for the same or different reasons, we could revisit that.

Currently, instances whose active VMM is `SagaUnwound` appear externally as `Stopped`. We decided to report them as `Stopped` because start sagas are permitted to run for instances with `SagaUnwound` active VMMs, and --- at the time when the `SagaUnwound` VMM state was introduced, `Failed` instances could not be started. However, #6455 added the ability to restart `Failed` instances, and #6652 will permit them to be stopped. Therefore, we should recast instances with `SagaUnwound` active VMMs as `Failed`: they weren't asked politely to stop; instead, we attempted to start them and something went wrong...which sounds like `Failed` to me. This becomes more important in light of #6638: if we will attempt automatically restart such instances, they should definitely appear to be `Failed`. The distinction between `Failed` and `Stopped` becomes that `Failed` means "this thing isn't running, but it's supposed to be; we may try to fix that for you if permitted to do so", while `Stopped` means "this thing isn't running and that's fine, because you asked for it to no longer be running". Thus, this commit changes `SagaUnwound` VMMs to appear `Failed` externally.

hawkw added the nexus Related to nexus label Sep 23, 2024

hawkw self-assigned this Sep 23, 2024

hawkw mentioned this issue Sep 23, 2024

[nexus] want to allow stopping Failed instances #6640

Closed

This was referenced Sep 24, 2024

[nexus] Consider SagaUnwound instances Failed #6658

Merged

[nexus] Reincarnate instances with SagaUnwound VMMs #6669

Merged

hawkw closed this as completed in #6669 Sep 27, 2024

hawkw closed this as completed in d7235b8 Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reincarnate instances with `SagaUnwound` VMMs #6638

reincarnate instances with `SagaUnwound` VMMs #6638

hawkw commented Sep 23, 2024

hawkw commented Sep 23, 2024

hawkw commented Sep 23, 2024

hawkw commented Sep 23, 2024

gjcolombo commented Sep 23, 2024

hawkw commented Sep 23, 2024

reincarnate instances with SagaUnwound VMMs #6638

reincarnate instances with SagaUnwound VMMs #6638

Comments

hawkw commented Sep 23, 2024

Footnotes

hawkw commented Sep 23, 2024

hawkw commented Sep 23, 2024

hawkw commented Sep 23, 2024

gjcolombo commented Sep 23, 2024

hawkw commented Sep 23, 2024

reincarnate instances with `SagaUnwound` VMMs #6638

reincarnate instances with `SagaUnwound` VMMs #6638