sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

gjcolombo · 2023-05-26T17:22:41Z

Sled agent has a relatively rich set of internal error types, but at the Dropshot HTTP error boundary, it converts almost all of them to 500 errors:

omicron/sled-agent/src/sled_agent.rs

Lines 102 to 137 in cab0925

 // Provide a more specific HTTP error for some sled agent errors. 

 impl From<Error> for dropshot::HttpError { 

 fn from(err: Error) -> Self { 

 match err { 

 crate::sled_agent::Error::Instance(instance_manager_error) => { 

 match instance_manager_error { 

 crate::instance_manager::Error::Instance( 

 instance_error, 

 ) => match instance_error { 

 crate::instance::Error::Propolis(propolis_error) => { 

 match propolis_error.status() { 

 None => HttpError::for_internal_error( 

 propolis_error.to_string(), 

 ), 

 Some(status_code) => { 

 HttpError::for_status(None, status_code) 

 } 

 } 

 } 

 crate::instance::Error::Transition(omicron_error) => { 

 // Preserve the status associated with the wrapped 

 // Omicron error so that Nexus will see it in the 

 // Progenitor client error it gets back. 

 HttpError::from(omicron_error) 

 } 

 e => HttpError::for_internal_error(e.to_string()), 

 }, 

 e => HttpError::for_internal_error(e.to_string()), 

 } 

 } 

 e => HttpError::for_internal_error(e.to_string()), 

 } 

 } 

 }

This makes it hard for callers to reason about the cause or permanence of any of these errors. This was a pebble in the shoe of PR #2892 and is coming up again in #3230 (sled agent is making a call to Nexus whose response depends in part on calls back down into sled agent, and there's no way to be sure about the permanence of any errors returned from Nexus in that path because all errors are getting flattened into a single error code).

There are similar paths through Nexus that flatten errors this way that the fix for #3230 will have to take into account:

Updating instance state can update the instance's Dendrite configuration:

omicron/nexus/src/app/instance.rs

Lines 1138 to 1144 in cab0925

self.instance_ensure_dpd_config(

opctx,

db_instance.id(),

&sled.address(),

None,

)

.await?;

instance_ensure_dpd_config calls dpd_client.ensure_nat_entry and maps all errors to 500s:

omicron/nexus/src/app/instance.rs

Lines 1243 to 1259 in cab0925

 dpd_client 

 .ensure_nat_entry( 

 &log, 

 target_ip.ip, 

 dpd_client::types::MacAddr { a: mac_address.into_array() }, 

 *target_ip.first_port, 

 *target_ip.last_port, 

 vni, 

 sled_ip_address.ip(), 

 ) 

 .await 

 .map_err(|e| { 

 Error::internal_error(&format!( 

 "failed to ensure dpd entry: {e}" 

 )) 

 })?; 

 }

ensure_nat_entry calls Dendrite daemon routines that can (I presume) produce transient failures (e.g. transient communication failures), e.g.:

omicron/dpd-client/src/lib.rs

Lines 111 to 127 in cab0925

 self.nat_ipv4_create( 

 &network.ip(), 

 target_first_port, 

 target_last_port, 

 &nat_target, 

 ) 

 .await 

 } 

 ipnetwork::IpNetwork::V6(network) => { 

 self.nat_ipv6_create( 

 &network.ip(), 

 target_first_port, 

 target_last_port, 

 &nat_target, 

 ) 

 .await 

 }

Revisiting absolutely every single error conversion in sled agent and Nexus all at once is probably a non-starter, but we can likely start by improving sled agent's internal-error-to-Dropshot-error conversion and look for opportunities to remove other error-flattening cases as they arise.

The text was updated successfully, but these errors were encountered:

gjcolombo · 2023-06-10T03:08:49Z

#3334 is a first step toward addressing this issue.

…ype from calls to stop/reboot (#4711) Restore `instance_reboot` and `instance_stop` to their prior behavior: if these routines try to contact sled agent and get back a server error, mark the instance as unhealthy and move it to the Failed state. Also use `#[source]` instead of message interpolation in `InstanceStateChangeError::SledAgent`. This restores the status quo ante from #4682 in anticipation of reaching a better overall mechanism for dealing with failures to communicate about instances with sled agents. See #3206, #3238, and #4226 for more discussion. Tests: new integration test; stood up a dev cluster, started an instance, killed the zone with `zoneadm halt`, and verified that calls to reboot/stop the instance eventually marked it as Failed (due to a timeout attempting to contact the Propolis zone). Fixes #4709.

hawkw · 2024-04-09T22:26:55Z

Looking closer at this, it looks like a lot of the work here has been addressed in #3334 and #4682. It's not particularly clear what else is left to do in this issue on the nexus/sled-agent` side. Perhaps @gjcolombo will have more input into this when he's back from leave.

We do need to get propolis#649 merged, though, which is sort of related...

hawkw · 2024-09-26T22:16:46Z

It looks like sled-agent is still not quite doing the right thing, here. If a sled-agent operation on behalf of Nexus encounters a client error from Propolis, it will now correctly forward that error code to Nexus...but, Nexus only moves instances to Failed in the face of sled-agent's 410 NO_SUCH_INSTANCE, and not if it sees the NoInstance error code from Propolis, which means we're not really handling the "propolis restarted" case correctly in Nexus, and restarted propoli still won't be moved to Failed. Agh.

Furthermore, a complete solution to this would require more than just having Nexus know about the Propolis NoInstance error (or sled-agent eat it and return its equivalent 410 NO_SUCH_INSTANCE). Instead, sled-agent needs to also know that that error means the VMM is way gone and take steps to take a zone bundle and then destroy the zone. So there's some more stuff we kinda forgot to do here... 😅

At present, sled-agent's handling of the error codes used by Propolis to indicate that it has crashed and been restarted is woefully incorrect. In particular, there are two cases where such an error may be encountered by a sled-agent: 1. When attempting to ask the VMM to change state (e.g. reboot or stop the instance) 2. When hitting Propolis' `instance-state-monitor` API endpoint to proactively check the VM's current state Neither of these are handled correctly today, In the first case, if a sled-agent operation on behalf of Nexus encounters a client error from Propolis, it will forward that error code to Nexus...but, _Nexus_ only moves instances to `Failed` in the face of sled-agent's `410 NO_SUCH_INSTANCE`, and _not_ [if it sees the `NoInstance` error code from Propolis][1], which means that the ghosts left behind by crashed and restarted Propolii still won't be moved to `Failed`. Agh. Furthermore, in that case, sled-agent itself will not perform the necessary cleanup actions to deal with the now-defunct Propolis zone (e.g. collecting a zone bundle and then destroying the zone). In the second case, where we hit the instance-state-monitor endpoint and get back a `NoInstance` error, things are equally dire. The `InstanceMonitorRunner` task, which is responsible for polling the state monitor endpoint, will just bail out on receipt of any error from Propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L253-L289 We would _expect_ this to drop the send-side of the channel it uses to communicate with the `InstanceRunner`, closing the channel, and hitting this select, which would correctly mark the VMM as failed and do what we want, despite eating the actual error message from propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L388-L394 HOWEVER, as the comment points out, the `InstanceRunner` is _also_ holding its own clone of the channel sender, keeping us from ever hitting that case: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L308-L309 AFAICT, this means we'll just totally give up on monitoring the instance as soon as we hit any error here, which seems...quite bad. I *think* that this actually means that when a Propolis process crashes unexpectedly, we'll get an error when the TCP connection closes, bail out, and then _never try hitting the instance monitor endpoint ever again_[^1]. So, we won't notice that the Propolis is actually out to lunch until we try to ask it to change state. **This Is Not Great!** This commit fixes both of these cases, by making sled-agent actually handle Propolis' error codes correctly. I've added a dependency on the `propolis_api_types` crate, which is where the error code lives, and some code in sled-agent to attempt to parse these codes when receiving an error response from the Propolis client. Now, when we try to PUT a new state to the instance, we'll look at the error code that comes back, mark it as `Failed` if the error indicates that we should do so, and publish the `Failed` VMM state to Nexus before tearing down the zone. The `InstanceMonitorTask` now behaves similarly, and I've changed it to retry all other errors with a backoff, rather than just bailing out immediately on receipt fo the first error. I've manually tested this on `london` as discussed here: #6726 (comment) Unfortunately, it's hard to do any kind of automated testing of this with the current test situation, as none of this code is exercised by the simulated sled-agent. Fixes #3209 Fixes #3238 [^1]: Although I've not actually verified that this is what happens. [1]: https://github.com/oxidecomputer/propolis/blob/516dabe473cecdc3baea1db98a80441968c144cf/crates/propolis-api-types/src/lib.rs#L434-L441

gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus labels May 26, 2023

jordanhendricks added the Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) label Aug 11, 2023

gjcolombo mentioned this issue Oct 7, 2023

Revisit moving instances to Failed in handle_instance_put_result #4226

Closed

This was referenced Dec 13, 2023

Let start saga handle unwinding from sled agent instance PUT errors #4682

Merged

Move instances to Failed if sled agent returns an "unhealthy" error type from calls to stop/reboot #4711

Merged

gjcolombo mentioned this issue Mar 9, 2024

Propolis zone cleanup could happen outside of InstanceRunner #5237

Open

morlandi7 added this to the 8 milestone Apr 4, 2024

hawkw self-assigned this Apr 4, 2024

morlandi7 modified the milestones: 8, 9 May 2, 2024

morlandi7 modified the milestones: 9, 10 Jul 1, 2024

gjcolombo mentioned this issue Jul 26, 2024

instance stop returns 500/fails the instance if the instance's VMM stops while the stop request is outstanding to sled agent #6165

Closed

morlandi7 modified the milestones: 10, 11 Aug 14, 2024

morlandi7 modified the milestones: 11, 12 Sep 26, 2024

This was referenced Sep 27, 2024

sled agent should terminate Propolis zones when Propolis indicates a previously-started VM has gone missing #3209

Closed

[sled-agent] Unbreak handling of Propolis error codes #6726

Merged

hawkw modified the milestones: 12, 11 Oct 1, 2024

hawkw closed this as completed in #6726 Oct 1, 2024

hawkw closed this as completed in 3093818 Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

gjcolombo commented May 26, 2023

gjcolombo commented Jun 10, 2023

hawkw commented Apr 9, 2024

hawkw commented Sep 26, 2024 •

edited

Loading

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

Comments

gjcolombo commented May 26, 2023

gjcolombo commented Jun 10, 2023

hawkw commented Apr 9, 2024

hawkw commented Sep 26, 2024 • edited Loading

hawkw commented Sep 26, 2024 •

edited

Loading