Let start saga handle unwinding from sled agent instance PUT errors #4682

gjcolombo · 2023-12-12T21:19:57Z

Remove Nexus::handle_instance_put_result. In its place, make Nexus instance routines that invoke sled agent instance PUT endpoints decide how to handle their own errors, and be more explicit about the specific kinds of errors these operations can produce. Use this flexibility to allow the instance start and migrate sagas handle failure to start a new instance (or to start a migration target) by unwinding instead of having to reckon with callee-defined side effects of failing a call to sled agent. Other callers continue to do what handle_instance_put_result did.

Improve some tests:

Add a test variation to reproduce Resource accounting error after failed instance-start #4662. To support this, teach the simulated sled agent to let callers inject failure into calls to ensure an instance's state.
Fix up a bit of simulated sled agent logic that was unfaithful to the real sled agent's behavior and that caused the new test to pass when it should have failed.
Make sure that start saga tests that unwind explicitly verify that unwinding the saga doesn't leak provisioning counters.

Tests: Cargo tests including the new start saga variation; smoke tested instance start/stop/reboot on a dev cluster.

Fixes #4662.

The tests still pass (the provisioning problems in this saga have to do with side effects from a failure in the ensure_running node) but these are still good checks to have.

Add a start saga integration test that shows provisioning counters being leaked when an instance fails to start in the ensure_running step. This requires some simulated sled agent changes to (a) provide a means to inject this sort of failure and (b) not make an "unregister instance" operation send updates to sled agent through the asynchronous notification channel (which happens to clean up the provisioning counters on its own in a way that makes the test pass).

Remove `Nexus::handle_instance_put_result` in favor of making individual callers deal with the results of trying to change an instance's state by issuing a PUT to a sled agent. This allows the instance start and migrate sagas to handle failure to start a VMM by unwinding the saga instead of unconditionally marking an instance as Failed.

gjcolombo · 2023-12-12T22:30:29Z

Dev cluster smoke testing came out clean, so I think this is ready for review.

smklein

Thanks for fixing this so quickly!

nexus/src/app/instance.rs

nexus/src/app/sagas/instance_start.rs

smklein · 2023-12-13T00:04:49Z

nexus/src/app/sagas/instance_start.rs

+                info!(osagactx.log(),
+                       "start saga: instance already unregistered from sled";
+                       "instance_id" => %instance_id);

-    Ok(())
+                Ok(())


Oof, yeah, I get this, but it feels a little sketchy to "do nothing" on this pathway. the realm of !inner.instance_unhealth() isn't super narrow, right?

We don't need to fix this now, I think this is at parity with main, but it jumps out to me as a little odd

One of the things I like about this change is that it brings out the importance of #3238 and #4226 :) You're right that we could do a lot more to reason carefully about what the different possible error codes mean here.

I added a comment to this match in 5608481.

…ype from calls to stop/reboot (#4711) Restore `instance_reboot` and `instance_stop` to their prior behavior: if these routines try to contact sled agent and get back a server error, mark the instance as unhealthy and move it to the Failed state. Also use `#[source]` instead of message interpolation in `InstanceStateChangeError::SledAgent`. This restores the status quo ante from #4682 in anticipation of reaching a better overall mechanism for dealing with failures to communicate about instances with sled agents. See #3206, #3238, and #4226 for more discussion. Tests: new integration test; stood up a dev cluster, started an instance, killed the zone with `zoneadm halt`, and verified that calls to reboot/stop the instance eventually marked it as Failed (due to a timeout attempting to contact the Propolis zone). Fixes #4709.

gjcolombo added 3 commits December 11, 2023 21:33

Add provisioning checks to start saga unwind test

6fadf58

The tests still pass (the provisioning problems in this saga have to do with side effects from a failure in the ensure_running node) but these are still good checks to have.

gjcolombo changed the title ~~Let start saga handle unwinding from sled agent error codes~~ Let start saga handle unwinding from sled agent instance PUT errors Dec 12, 2023

gjcolombo marked this pull request as ready for review December 12, 2023 22:30

smklein self-requested a review December 12, 2023 23:03

smklein approved these changes Dec 13, 2023

View reviewed changes

gjcolombo added 5 commits December 13, 2023 00:20

display inner error when printing sled agent-caused state change errors

b17d649

doc comments

1cf8a72

improve logging in mark_instance_failed

9e6c86e

fix nonsense comment in start saga definition

631b758

add dire warning

5608481

gjcolombo enabled auto-merge (squash) December 13, 2023 00:59

gjcolombo merged commit 180616e into main Dec 13, 2023
20 checks passed

gjcolombo deleted the gjcolombo/provisioning-underflow branch December 13, 2023 02:45

askfongjojo mentioned this pull request Dec 16, 2023

Instances no longer transition to failed state when propolis zone has crashed or is gone #4709

Closed

gjcolombo mentioned this pull request Dec 18, 2023

Move instances to Failed if sled agent returns an "unhealthy" error type from calls to stop/reboot #4711

Merged

hawkw mentioned this pull request Apr 9, 2024

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let start saga handle unwinding from sled agent instance PUT errors #4682

Let start saga handle unwinding from sled agent instance PUT errors #4682

gjcolombo commented Dec 12, 2023

gjcolombo commented Dec 12, 2023

smklein left a comment

smklein Dec 13, 2023

gjcolombo Dec 13, 2023

Let start saga handle unwinding from sled agent instance PUT errors #4682

Let start saga handle unwinding from sled agent instance PUT errors #4682

Conversation

gjcolombo commented Dec 12, 2023

gjcolombo commented Dec 12, 2023

smklein left a comment

Choose a reason for hiding this comment

smklein Dec 13, 2023

Choose a reason for hiding this comment

gjcolombo Dec 13, 2023

Choose a reason for hiding this comment