Skip to content

Commit

Permalink
Let start saga handle unwinding from sled agent instance PUT errors (#…
Browse files Browse the repository at this point in the history
…4682)

Remove `Nexus::handle_instance_put_result`. In its place, make Nexus
instance routines that invoke sled agent instance PUT endpoints decide
how to handle their own errors, and be more explicit about the specific
kinds of errors these operations can produce. Use this flexibility to
allow the instance start and migrate sagas handle failure to start a new
instance (or to start a migration target) by unwinding instead of having
to reckon with callee-defined side effects of failing a call to sled
agent. Other callers continue to do what `handle_instance_put_result`
did.

Improve some tests:

- Add a test variation to reproduce #4662. To support this, teach the
simulated sled agent to let callers inject failure into calls to ensure
an instance's state.
- Fix up a bit of simulated sled agent logic that was unfaithful to the
real sled agent's behavior and that caused the new test to pass when it
should have failed.
- Make sure that start saga tests that unwind explicitly verify that
unwinding the saga doesn't leak provisioning counters.

Tests: Cargo tests including the new start saga variation; smoke tested
instance start/stop/reboot on a dev cluster.

Fixes #4662.
  • Loading branch information
gjcolombo authored Dec 13, 2023
1 parent 2d95aac commit 180616e
Show file tree
Hide file tree
Showing 7 changed files with 569 additions and 277 deletions.
Loading

0 comments on commit 180616e

Please sign in to comment.