Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

askfongjojo · 2023-12-16T06:34:45Z

The behavior change appears to be related to #4682. Prior to that, instances were moved to failed state by sled-agent or when user attempted to reboot the instance.

The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.

gjcolombo · 2023-12-16T17:07:31Z

Agreed that this is an unforeseen consequence of #4682. Previously, Nexus::instance_reboot could assume that instance_request_state would call handle_instance_put_result, which would handle these sorts of errors by transitioning instances to Failed. After #4682, callers of instance_request_state have to decide how to handle state change failures themselves, and instance_reboot does that by just using the ? operator to kick errors up to the caller.

Using a similar pattern to instance_clear_migration_ids (check to see if the state change request returned an error; if it's a SledAgentInstancePutError and instance_unhealthy() returns true, call mark_instance_failed before returning the error) should restore the old behavior.

The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.

This is tracked by #3206.

…ype from calls to stop/reboot (#4711) Restore `instance_reboot` and `instance_stop` to their prior behavior: if these routines try to contact sled agent and get back a server error, mark the instance as unhealthy and move it to the Failed state. Also use `#[source]` instead of message interpolation in `InstanceStateChangeError::SledAgent`. This restores the status quo ante from #4682 in anticipation of reaching a better overall mechanism for dealing with failures to communicate about instances with sled agents. See #3206, #3238, and #4226 for more discussion. Tests: new integration test; stood up a dev cluster, started an instance, killed the zone with `zoneadm halt`, and verified that calls to reboot/stop the instance eventually marked it as Failed (due to a timeout attempting to contact the Propolis zone). Fixes #4709.

askfongjojo added this to the 6 milestone Dec 16, 2023

gjcolombo self-assigned this Dec 16, 2023

gjcolombo mentioned this issue Dec 18, 2023

Move instances to Failed if sled agent returns an "unhealthy" error type from calls to stop/reboot #4711

Merged

morlandi7 added the bug Something that isn't working. label Dec 18, 2023

gjcolombo closed this as completed in #4711 Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

askfongjojo commented Dec 16, 2023 •

edited

Loading

gjcolombo commented Dec 16, 2023 •

edited

Loading

Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

Comments

askfongjojo commented Dec 16, 2023 • edited Loading

gjcolombo commented Dec 16, 2023 • edited Loading

askfongjojo commented Dec 16, 2023 •

edited

Loading

gjcolombo commented Dec 16, 2023 •

edited

Loading