-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let start saga handle unwinding from sled agent instance PUT errors #4682
Conversation
The tests still pass (the provisioning problems in this saga have to do with side effects from a failure in the ensure_running node) but these are still good checks to have.
Add a start saga integration test that shows provisioning counters being leaked when an instance fails to start in the ensure_running step. This requires some simulated sled agent changes to (a) provide a means to inject this sort of failure and (b) not make an "unregister instance" operation send updates to sled agent through the asynchronous notification channel (which happens to clean up the provisioning counters on its own in a way that makes the test pass).
Remove `Nexus::handle_instance_put_result` in favor of making individual callers deal with the results of trying to change an instance's state by issuing a PUT to a sled agent. This allows the instance start and migrate sagas to handle failure to start a VMM by unwinding the saga instead of unconditionally marking an instance as Failed.
Dev cluster smoke testing came out clean, so I think this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this so quickly!
info!(osagactx.log(), | ||
"start saga: instance already unregistered from sled"; | ||
"instance_id" => %instance_id); | ||
|
||
Ok(()) | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, yeah, I get this, but it feels a little sketchy to "do nothing" on this pathway. the realm of !inner.instance_unhealth()
isn't super narrow, right?
We don't need to fix this now, I think this is at parity with main
, but it jumps out to me as a little odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ype from calls to stop/reboot (#4711) Restore `instance_reboot` and `instance_stop` to their prior behavior: if these routines try to contact sled agent and get back a server error, mark the instance as unhealthy and move it to the Failed state. Also use `#[source]` instead of message interpolation in `InstanceStateChangeError::SledAgent`. This restores the status quo ante from #4682 in anticipation of reaching a better overall mechanism for dealing with failures to communicate about instances with sled agents. See #3206, #3238, and #4226 for more discussion. Tests: new integration test; stood up a dev cluster, started an instance, killed the zone with `zoneadm halt`, and verified that calls to reboot/stop the instance eventually marked it as Failed (due to a timeout attempting to contact the Propolis zone). Fixes #4709.
Remove
Nexus::handle_instance_put_result
. In its place, make Nexus instance routines that invoke sled agent instance PUT endpoints decide how to handle their own errors, and be more explicit about the specific kinds of errors these operations can produce. Use this flexibility to allow the instance start and migrate sagas handle failure to start a new instance (or to start a migration target) by unwinding instead of having to reckon with callee-defined side effects of failing a call to sled agent. Other callers continue to do whathandle_instance_put_result
did.Improve some tests:
Tests: Cargo tests including the new start saga variation; smoke tested instance start/stop/reboot on a dev cluster.
Fixes #4662.