Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

Closed
askfongjojo opened this issue Dec 16, 2023 · 1 comment · Fixed by #4711
Closed
Assignees
Labels
bug Something that isn't working.
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Dec 16, 2023

The behavior change appears to be related to #4682. Prior to that, instances were moved to failed state by sled-agent or when user attempted to reboot the instance.

The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.

@askfongjojo askfongjojo added this to the 6 milestone Dec 16, 2023
@gjcolombo
Copy link
Contributor

gjcolombo commented Dec 16, 2023

Agreed that this is an unforeseen consequence of #4682. Previously, Nexus::instance_reboot could assume that instance_request_state would call handle_instance_put_result, which would handle these sorts of errors by transitioning instances to Failed. After #4682, callers of instance_request_state have to decide how to handle state change failures themselves, and instance_reboot does that by just using the ? operator to kick errors up to the caller.

Using a similar pattern to instance_clear_migration_ids (check to see if the state change request returned an error; if it's a SledAgentInstancePutError and instance_unhealthy() returns true, call mark_instance_failed before returning the error) should restore the old behavior.

The ability to handle propolis zone gone situation may not be by design (though we rely on it quite a bit now during sled planned or unplanned reboot). The ability to handle propolis zone crashing due to unhandled hypervisor or OS exception is probably something the system should have.

This is tracked by #3206.

@gjcolombo gjcolombo self-assigned this Dec 16, 2023
@morlandi7 morlandi7 added the bug Something that isn't working. label Dec 18, 2023
gjcolombo added a commit that referenced this issue Dec 18, 2023
…ype from calls to stop/reboot (#4711)

Restore `instance_reboot` and `instance_stop` to their prior behavior:
if these routines try to contact sled agent and get back a server error,
mark the instance as unhealthy and move it to the Failed state.

Also use `#[source]` instead of message interpolation in
`InstanceStateChangeError::SledAgent`.

This restores the status quo ante from #4682 in anticipation of reaching
a better overall mechanism for dealing with failures to communicate
about instances with sled agents. See #3206, #3238, and #4226 for more
discussion.

Tests: new integration test; stood up a dev cluster, started an
instance, killed the zone with `zoneadm halt`, and verified that calls
to reboot/stop the instance eventually marked it as Failed (due to a
timeout attempting to contact the Propolis zone).

Fixes #4709.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working.
Projects
None yet
3 participants