Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

Closed
gjcolombo opened this issue May 26, 2023 · 3 comments · Fixed by #6726
Closed

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

gjcolombo opened this issue May 26, 2023 · 3 comments · Fixed by #6726
Assignees
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management
Milestone

Comments

@gjcolombo
Copy link
Contributor

Sled agent has a relatively rich set of internal error types, but at the Dropshot HTTP error boundary, it converts almost all of them to 500 errors:

// Provide a more specific HTTP error for some sled agent errors.
impl From<Error> for dropshot::HttpError {
fn from(err: Error) -> Self {
match err {
crate::sled_agent::Error::Instance(instance_manager_error) => {
match instance_manager_error {
crate::instance_manager::Error::Instance(
instance_error,
) => match instance_error {
crate::instance::Error::Propolis(propolis_error) => {
match propolis_error.status() {
None => HttpError::for_internal_error(
propolis_error.to_string(),
),
Some(status_code) => {
HttpError::for_status(None, status_code)
}
}
}
crate::instance::Error::Transition(omicron_error) => {
// Preserve the status associated with the wrapped
// Omicron error so that Nexus will see it in the
// Progenitor client error it gets back.
HttpError::from(omicron_error)
}
e => HttpError::for_internal_error(e.to_string()),
},
e => HttpError::for_internal_error(e.to_string()),
}
}
e => HttpError::for_internal_error(e.to_string()),
}
}
}

This makes it hard for callers to reason about the cause or permanence of any of these errors. This was a pebble in the shoe of PR #2892 and is coming up again in #3230 (sled agent is making a call to Nexus whose response depends in part on calls back down into sled agent, and there's no way to be sure about the permanence of any errors returned from Nexus in that path because all errors are getting flattened into a single error code).

There are similar paths through Nexus that flatten errors this way that the fix for #3230 will have to take into account:

  • Updating instance state can update the instance's Dendrite configuration:
    self.instance_ensure_dpd_config(
    opctx,
    db_instance.id(),
    &sled.address(),
    None,
    )
    .await?;
  • instance_ensure_dpd_config calls dpd_client.ensure_nat_entry and maps all errors to 500s:
    dpd_client
    .ensure_nat_entry(
    &log,
    target_ip.ip,
    dpd_client::types::MacAddr { a: mac_address.into_array() },
    *target_ip.first_port,
    *target_ip.last_port,
    vni,
    sled_ip_address.ip(),
    )
    .await
    .map_err(|e| {
    Error::internal_error(&format!(
    "failed to ensure dpd entry: {e}"
    ))
    })?;
    }
  • ensure_nat_entry calls Dendrite daemon routines that can (I presume) produce transient failures (e.g. transient communication failures), e.g.:
    self.nat_ipv4_create(
    &network.ip(),
    target_first_port,
    target_last_port,
    &nat_target,
    )
    .await
    }
    ipnetwork::IpNetwork::V6(network) => {
    self.nat_ipv6_create(
    &network.ip(),
    target_first_port,
    target_last_port,
    &nat_target,
    )
    .await
    }

Revisiting absolutely every single error conversion in sled agent and Nexus all at once is probably a non-starter, but we can likely start by improving sled agent's internal-error-to-Dropshot-error conversion and look for opportunities to remove other error-flattening cases as they arise.

@gjcolombo gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus labels May 26, 2023
@gjcolombo
Copy link
Contributor Author

#3334 is a first step toward addressing this issue.

@jordanhendricks jordanhendricks added the Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) label Aug 11, 2023
gjcolombo added a commit that referenced this issue Dec 18, 2023
…ype from calls to stop/reboot (#4711)

Restore `instance_reboot` and `instance_stop` to their prior behavior:
if these routines try to contact sled agent and get back a server error,
mark the instance as unhealthy and move it to the Failed state.

Also use `#[source]` instead of message interpolation in
`InstanceStateChangeError::SledAgent`.

This restores the status quo ante from #4682 in anticipation of reaching
a better overall mechanism for dealing with failures to communicate
about instances with sled agents. See #3206, #3238, and #4226 for more
discussion.

Tests: new integration test; stood up a dev cluster, started an
instance, killed the zone with `zoneadm halt`, and verified that calls
to reboot/stop the instance eventually marked it as Failed (due to a
timeout attempting to contact the Propolis zone).

Fixes #4709.
@morlandi7 morlandi7 added this to the 8 milestone Apr 4, 2024
@hawkw hawkw self-assigned this Apr 4, 2024
@hawkw
Copy link
Member

hawkw commented Apr 9, 2024

Looking closer at this, it looks like a lot of the work here has been addressed in #3334 and #4682. It's not particularly clear what else is left to do in this issue on the nexus/sled-agent` side. Perhaps @gjcolombo will have more input into this when he's back from leave.

We do need to get propolis#649 merged, though, which is sort of related...

@hawkw
Copy link
Member

hawkw commented Sep 26, 2024

It looks like sled-agent is still not quite doing the right thing, here. If a sled-agent operation on behalf of Nexus encounters a client error from Propolis, it will now correctly forward that error code to Nexus...but, Nexus only moves instances to Failed in the face of sled-agent's 410 NO_SUCH_INSTANCE, and not if it sees the NoInstance error code from Propolis, which means we're not really handling the "propolis restarted" case correctly in Nexus, and restarted propoli still won't be moved to Failed. Agh.

Furthermore, a complete solution to this would require more than just having Nexus know about the Propolis NoInstance error (or sled-agent eat it and return its equivalent 410 NO_SUCH_INSTANCE). Instead, sled-agent needs to also know that that error means the VMM is way gone and take steps to take a zone bundle and then destroy the zone. So there's some more stuff we kinda forgot to do here... 😅

@hawkw hawkw modified the milestones: 12, 11 Oct 1, 2024
@hawkw hawkw closed this as completed in 3093818 Oct 1, 2024
hawkw added a commit that referenced this issue Oct 2, 2024
At present, sled-agent's handling of the error codes used by Propolis to
indicate that it has crashed and been restarted is woefully incorrect.
In particular, there are two cases where such an error may be
encountered by a sled-agent:

1. When attempting to ask the VMM to change state (e.g. reboot or stop
   the instance)
2. When hitting Propolis' `instance-state-monitor` API endpoint to
   proactively check the VM's current state

Neither of these are handled correctly today,

In the first case, if a sled-agent operation on behalf of Nexus
encounters a client error from Propolis, it will forward that error code
to Nexus...but, _Nexus_ only moves instances to `Failed` in the face of
sled-agent's `410 NO_SUCH_INSTANCE`, and _not_ [if it sees the
`NoInstance` error code from Propolis][1], which means that the ghosts
left behind by crashed and restarted Propolii still won't be moved to
`Failed`. Agh. Furthermore, in that case, sled-agent itself will not
perform the necessary cleanup actions to deal with the now-defunct
Propolis zone (e.g. collecting a zone bundle and then destroying the
zone).

In the second case, where we hit the instance-state-monitor endpoint and
get back a `NoInstance` error, things are equally dire. The
`InstanceMonitorRunner` task, which is responsible for polling the state
monitor endpoint, will just bail out on receipt of any error from
Propolis:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L253-L289

We would _expect_ this to drop the send-side of the channel it uses to
communicate with the `InstanceRunner`, closing the channel, and hitting
this select, which would correctly mark the VMM as failed and do what we
want, despite eating the actual error message from propolis:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L388-L394

HOWEVER, as the comment points out, the `InstanceRunner` is _also_
holding its own clone of the channel sender, keeping us from ever
hitting that case:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L308-L309

AFAICT, this means we'll just totally give up on monitoring the instance
as soon as we hit any error here, which seems...quite bad. I *think*
that this actually means that when a Propolis process crashes
unexpectedly, we'll get an error when the TCP connection closes, bail
out, and then _never try hitting the instance monitor endpoint ever
again_[^1]. So, we won't notice that the Propolis is actually out to
lunch until we try to ask it to change state. **This Is Not Great!**

This commit fixes both of these cases, by making sled-agent actually
handle Propolis' error codes correctly. I've added a dependency on the
`propolis_api_types` crate, which is where the error code lives, and
some code in sled-agent to attempt to parse these codes when receiving
an error response from the Propolis client. Now, when we try to PUT a
new state to the instance, we'll look at the error code that comes back,
mark it as `Failed` if the error indicates that we should do so, and
publish the `Failed` VMM state to Nexus before tearing down the zone.
The `InstanceMonitorTask` now behaves similarly, and I've changed it to
retry all other errors with a backoff, rather than just bailing out
immediately on receipt fo the first error.

I've manually tested this on `london` as discussed here:

#6726 (comment)

Unfortunately, it's hard to do any kind of automated testing of this
with the current test situation, as none of this code is exercised by
the simulated sled-agent.

Fixes #3209
Fixes #3238

[^1]: Although I've not actually verified that this is what happens.

[1]: https://github.com/oxidecomputer/propolis/blob/516dabe473cecdc3baea1db98a80441968c144cf/crates/propolis-api-types/src/lib.rs#L434-L441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants