Revisit moving instances to Failed in `handle_instance_put_result` #4226

gjcolombo · 2023-10-07T00:56:29Z

Follow-up from the refactoring in #4194.

If Nexus calls sled agent to do something that registers an instance or changes its state, sled agent will try to return a Result<Option<SledInstanceState>, Error>. Nexus will then get back one of the following values:

Ok(None) if the call succeeded and this didn't change what sled agent thinks the instance/VMM state is
Ok(Some(new_state)) if the call succeeded and sled agent wants Nexus to try to write some new state back to CRDB
Err(e) if the call failed, where e: progenitor_client::Error<sled_agent_client::Error>

The first two cases are handled by simply trying to write a state update to CRDB and aren't important here. The error case, however, walks a tightrope:

omicron/nexus/src/app/instance.rs

Lines 891 to 943 in 2bb0fdc

    
           Err(e) => { 
        
               // The sled-agent has told us that it can't do what we 
        
               // requested, but does that mean a failure? One example would be 
        
               // if we try to "reboot" a stopped instance. That shouldn't 
        
               // transition the instance to failed. But if the sled-agent 
        
               // *can't* boot a stopped instance, that should transition 
        
               // to failed. 
        
               // 
        
               // Without a richer error type, let the sled-agent tell Nexus 
        
               // what to do with status codes. 
        
               error!(self.log, "saw {} from instance_put!", e); 
        
               // Convert to the Omicron API error type. 
        
               // 
        
               // N.B. The match below assumes that this conversion will turn 
        
               //      any 400-level error status from sled agent into an 
        
               //      `Error::InvalidRequest`. 
        
               let e = e.into(); 
        
               match &e { 
        
                   // Bad request shouldn't change the instance state. 
        
                   Error::InvalidRequest { .. } => Err(e), 
        
                   // Internal server error (or anything else) should change 
        
                   // the instance state to failed, we don't know what state 
        
                   // the instance is in. 
        
                   _ => { 
        
                       let new_runtime = db::model::InstanceRuntimeState { 
        
                           state: db::model::InstanceState::new( 
        
                               InstanceState::Failed, 
        
                           ), 
        
                           gen: db_instance.runtime_state.gen.next().into(), 
        
                           ..db_instance.runtime_state.clone() 
        
                       }; 
        
                       // XXX what if this fails? 
        
                       let result = self 
        
                           .db_datastore 
        
                           .instance_update_runtime( 
        
                               &db_instance.id(), 
        
                               &new_runtime, 
        
                           ) 
        
                           .await; 
        
                       error!( 
        
                           self.log, 
        
                           "saw {:?} from setting InstanceState::Failed after bad instance_put", 
        
                           result, 
        
                       ); 
        
                       Err(e) 
        
                   } 
        
               }

Nexus will take the Progenitor error it got back, feed it through a From impl to convert it to an Omicron external API error, and then see if that error is a Bad Request error or something else. In the former case, the update is dropped; in the latter case, Nexus assumes some catastrophe has happened and that the instance has to be marked as Failed.

The relevant From impl is here:

omicron/common/src/api/external/error.rs

Lines 355 to 418 in c030293

    
           // TODO-security this `From` may give us a shortcut in situtations where we 
        
           // want more robust consideration of errors. For example, while some errors 
        
           // from other services may directly result in errors that percolate to the 
        
           // external client, others may require, for example, retries with an alternate 
        
           // service instance or additional interpretation to sanitize the output error. 
        
           // This should be removed to avoid leaking data. 
        
           impl<T: ClientError> From<progenitor::progenitor_client::Error<T>> for Error { 
        
               fn from(e: progenitor::progenitor_client::Error<T>) -> Self { 
        
                   match e { 
        
                       // This error indicates that the inputs were not valid for this API 
        
                       // call. It's reflective of either a client-side programming error. 
        
                       progenitor::progenitor_client::Error::InvalidRequest(msg) => { 
        
                           Error::internal_error(&format!("InvalidRequest: {}", msg)) 
        
                       } 
        
                       // This error indicates a problem with the request to the remote 
        
                       // service that did not result in an HTTP response code, but rather 
        
                       // pertained to local (i.e. client-side) encoding or network 
        
                       // communication. 
        
                       progenitor::progenitor_client::Error::CommunicationError(ee) => { 
        
                           Error::internal_error(&format!("CommunicationError: {}", ee)) 
        
                       } 
        
                       // This error represents an expected error from the remote service. 
        
                       progenitor::progenitor_client::Error::ErrorResponse(rv) => { 
        
                           let message = rv.message(); 
        
                           match rv.status() { 
        
                               http::StatusCode::SERVICE_UNAVAILABLE => { 
        
                                   Error::unavail(&message) 
        
                               } 
        
                               status if status.is_client_error() => { 
        
                                   Error::invalid_request(&message) 
        
                               } 
        
                               _ => Error::internal_error(&message), 
        
                           } 
        
                       } 
        
                       // This error indicates that the body returned by the client didn't 
        
                       // match what was documented in the OpenAPI description for the 
        
                       // service. This could only happen for us in the case of a severe 
        
                       // logic/encoding bug in the remote service or due to a failure of 
        
                       // our version constraints (i.e. that the call was to a newer 
        
                       // service with an incompatible response). 
        
                       progenitor::progenitor_client::Error::InvalidResponsePayload( 
        
                           ee, 
        
                       ) => Error::internal_error(&format!( 
        
                           "InvalidResponsePayload: {}", 
        
                           ee, 
        
                       )), 
        
                       // This error indicates that the client generated a response code 
        
                       // that was not described in the OpenAPI description for the 
        
                       // service; this could be a success or failure response, but either 
        
                       // way it indicates a logic or version error as above. 
        
                       progenitor::progenitor_client::Error::UnexpectedResponse(r) => { 
        
                           Error::internal_error(&format!( 
        
                               "UnexpectedResponse: status code {}", 
        
                               r.status(), 
        
                           )) 
        
                       } 
        
                   } 
        
               } 
        
           }

Notice the following:

Progenitor errors that didn't produce a response from sled agent--e.g. disconnections due to timeouts or network weather--all get turned into internal errors. These will cause an instance to move to Failed even if the problem was completely transient.
Sled agent responses get flattened into either 400 Bad Request (for all 4XX errors) or 500 Internal Server Error (everything else) unless they happen to be 503 Service Unavailable, which is preserved. (See sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238.)

This creates a great many ways for an instance to be marked as Failed, even if the VM is actually perfectly healthy. The Failed state is also a pain to deal with because a Failed instance can only be deleted and not recovered in any other way (see #2825 and especially #2825 (comment)).

Apart from all of this, an instance can be marked as Failed because of a Propolis failure: for example, if Propolis hits an error while setting up a VM component, it will publish the Failed state and that will bubble back up to Nexus via sled agent.

We should consider a subtler approach to all this. One possible starting place is to distinguish Propolis failures (which really do mean "this VM is dead") from errors accessing the instance from the control plane. We could then, e.g., indicate the latter faults without necessarily changing the instance state, allow further attempts to try to change the instance's state from its last-known state, and/or try to recover connectivity to the instance from a background task.

The text was updated successfully, but these errors were encountered:

…ype from calls to stop/reboot (#4711) Restore `instance_reboot` and `instance_stop` to their prior behavior: if these routines try to contact sled agent and get back a server error, mark the instance as unhealthy and move it to the Failed state. Also use `#[source]` instead of message interpolation in `InstanceStateChangeError::SledAgent`. This restores the status quo ante from #4682 in anticipation of reaching a better overall mechanism for dealing with failures to communicate about instances with sled agents. See #3206, #3238, and #4226 for more discussion. Tests: new integration test; stood up a dev cluster, started an instance, killed the zone with `zoneadm halt`, and verified that calls to reboot/stop the instance eventually marked it as Failed (due to a timeout attempting to contact the Propolis zone). Fixes #4709.

A number of bugs relating to guest instance lifecycle management have been observed. These include: - Instances getting "stuck" in a transient state, such as `Starting` or `Stopping`, with no way to forcibly terminate them (#4004) - Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (#5042) - Instances entering and exiting the `Failed` state when nothing is actually wrong with them, potentially leaking virtual resources (#4226) These typically require support intervention to resolve. Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular: - **(Conceptual) ownership of the CRDB `instance` record is currently split between Nexus and sled-agent(s).** Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed. - **Nexus doesn't ensure that instance state updates are processed reliably.** Instance state transitions triggered by user actions, such as `instance-start` and `instance-delete`, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is *not* the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed. This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes: - **Nexus is now the sole owner of the `instance` record.** Sled-agents no longer have their own copies of an instance's `InstanceRuntimeState`, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the `vmm` and `migration` records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update. - **When an instance has an active VMM, its effective external state is determined primarily by the active `vmm` record**, so that fewer state transitions *require* changes to the `instance` record. PR #5854 laid the ground work for this change, but it's relevant here as well. - **All updates to an `instance` record (and resources conceptually owned by that instance) are performed by a distributed saga.** I've introduced a new `instance-update` saga, which is responsible for performing all changes to the `instance` record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death. - **Consistency of instance state changes is ensured by distributed locking.** State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a *different* Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it *should* have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently. To prevent these race conditions, this PR has the dubious honor of using the first _distributed lock_ in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR #5831 --- see that branch for more discussion of locking. - **Background tasks are added to prevent missed updates**. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new `instance-updater` background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas. Currently, the instance update saga runs in the following cases: - An instance's active VMM transitions to `Destroyed`, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked. - Either side of an instance's live migration reports that the migration has completed successfully. - Either side of an instance's live migration reports that the migration has failed. The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a [lengthy comment] that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message. [lengthy comment]: https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254

Change sled agent's instance lookup tables so that Propolis jobs are indexed by Propolis/VMM IDs instead of instance IDs. This is a prerequisite to revisiting how the Failed instance state works. See RFD 486 section 6.1 for all the details of why this is needed, but very broadly: when an instance's VMM is Destroyed, we'd like sled agent to tell Nexus that *before* the agent deregisters the instance from the sled, for reasons described in the RFD; but if we do that with no other changes, there's a race where Nexus may try to restart the instance on the same sled before sled agent can update its instance table, causing instance start to fail. To achieve this: - In sled agent, change the `InstanceManagerRunner`'s instance map to a `BTreeMap<PropolisUuid, Instance>`, then clean up all the compilation errors. - In Nexus: - Make callers of instance APIs furnish a Propolis ID instead of an instance ID. This is generally very straightforward because they already had to get a VMM record to figure out what sled to talk to. - Change `cpapi_instances_put` to take a Propolis ID instead of an instance ID. Regular sled agent still has both IDs, but with these changes, simulated sled agents only have a Propolis ID to work with, and plumbing an instance ID down to them requires significant code changes. - Update test code: - Unify the Nexus helper routines that let integration tests get sled agent clients or sled IDs; now they get a single struct containing both of those and the instance's Propolis IDs. - Update users of the simulated agent's `poke` endpoints to use Propolis IDs. - Delete the "detach disks on instance stop" bits of simulated sled agent. These don't appear to be load-bearing, they don't correspond to any behavior in the actual sled agent (which doesn't manage disk attachment or detachment), and it was a pain to rework them to work with Propolis IDs. Tests: cargo nextest. Related: #4226 and #4872, among others.

hawkw · 2024-09-24T17:22:47Z

I think #6455, which implemented RFD 486's more principled rules for what sled-agent errors should move an instance to Failed, has closed this. @gjcolombo do you agree?

gjcolombo · 2024-09-24T22:31:59Z

Yeah, I think we've taken care of this one.

gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus virtualization Propolis Integration & VM Management labels Oct 7, 2023

This was referenced Dec 13, 2023

Let start saga handle unwinding from sled agent instance PUT errors #4682

Merged

Move instances to Failed if sled agent returns an "unhealthy" error type from calls to stop/reboot #4711

Merged

gjcolombo mentioned this issue Jan 23, 2024

need a way to trigger cleanup and next steps for vanished instances #4872

Closed

gjcolombo mentioned this issue Mar 6, 2024

sled-agent: don't block during instance creation request from nexus #4691

Merged

hawkw mentioned this issue Jul 23, 2024

Perform instance state transitions in instance-update saga #5749

Merged

gjcolombo mentioned this issue Aug 24, 2024

sled agent: index running VMMs by VMM ID, not instance ID #6429

Merged

gjcolombo closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit moving instances to Failed in `handle_instance_put_result` #4226

Revisit moving instances to Failed in `handle_instance_put_result` #4226

gjcolombo commented Oct 7, 2023

hawkw commented Sep 24, 2024

gjcolombo commented Sep 24, 2024

Revisit moving instances to Failed in handle_instance_put_result #4226

Revisit moving instances to Failed in handle_instance_put_result #4226

Comments

gjcolombo commented Oct 7, 2023

hawkw commented Sep 24, 2024

gjcolombo commented Sep 24, 2024

Revisit moving instances to Failed in `handle_instance_put_result` #4226

Revisit moving instances to Failed in `handle_instance_put_result` #4226