Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let start saga handle unwinding from sled agent instance PUT errors #4682

Merged
merged 8 commits into from
Dec 13, 2023

Conversation

gjcolombo
Copy link
Contributor

Remove Nexus::handle_instance_put_result. In its place, make Nexus instance routines that invoke sled agent instance PUT endpoints decide how to handle their own errors, and be more explicit about the specific kinds of errors these operations can produce. Use this flexibility to allow the instance start and migrate sagas handle failure to start a new instance (or to start a migration target) by unwinding instead of having to reckon with callee-defined side effects of failing a call to sled agent. Other callers continue to do what handle_instance_put_result did.

Improve some tests:

  • Add a test variation to reproduce Resource accounting error after failed instance-start #4662. To support this, teach the simulated sled agent to let callers inject failure into calls to ensure an instance's state.
  • Fix up a bit of simulated sled agent logic that was unfaithful to the real sled agent's behavior and that caused the new test to pass when it should have failed.
  • Make sure that start saga tests that unwind explicitly verify that unwinding the saga doesn't leak provisioning counters.

Tests: Cargo tests including the new start saga variation; smoke tested instance start/stop/reboot on a dev cluster.

Fixes #4662.

The tests still pass (the provisioning problems in this saga have to do with
side effects from a failure in the ensure_running node) but these are still good
checks to have.
Add a start saga integration test that shows provisioning counters being leaked
when an instance fails to start in the ensure_running step. This requires some
simulated sled agent changes to (a) provide a means to inject this sort of
failure and (b) not make an "unregister instance" operation send updates to
sled agent through the asynchronous notification channel (which happens to clean
up the provisioning counters on its own in a way that makes the test pass).
Remove `Nexus::handle_instance_put_result` in favor of making individual callers
deal with the results of trying to change an instance's state by issuing a PUT
to a sled agent. This allows the instance start and migrate sagas to handle
failure to start a VMM by unwinding the saga instead of unconditionally marking
an instance as Failed.
@gjcolombo gjcolombo changed the title Let start saga handle unwinding from sled agent error codes Let start saga handle unwinding from sled agent instance PUT errors Dec 12, 2023
@gjcolombo
Copy link
Contributor Author

Dev cluster smoke testing came out clean, so I think this is ready for review.

@gjcolombo gjcolombo marked this pull request as ready for review December 12, 2023 22:30
@smklein smklein self-requested a review December 12, 2023 23:03
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this so quickly!

nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/instance_start.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/instance_start.rs Show resolved Hide resolved
Comment on lines +639 to +643
info!(osagactx.log(),
"start saga: instance already unregistered from sled";
"instance_id" => %instance_id);

Ok(())
Ok(())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, yeah, I get this, but it feels a little sketchy to "do nothing" on this pathway. the realm of !inner.instance_unhealth() isn't super narrow, right?

We don't need to fix this now, I think this is at parity with main, but it jumps out to me as a little odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things I like about this change is that it brings out the importance of #3238 and #4226 :) You're right that we could do a lot more to reason carefully about what the different possible error codes mean here.

I added a comment to this match in 5608481.

@gjcolombo gjcolombo enabled auto-merge (squash) December 13, 2023 00:59
@gjcolombo gjcolombo merged commit 180616e into main Dec 13, 2023
20 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/provisioning-underflow branch December 13, 2023 02:45
gjcolombo added a commit that referenced this pull request Dec 18, 2023
…ype from calls to stop/reboot (#4711)

Restore `instance_reboot` and `instance_stop` to their prior behavior:
if these routines try to contact sled agent and get back a server error,
mark the instance as unhealthy and move it to the Failed state.

Also use `#[source]` instead of message interpolation in
`InstanceStateChangeError::SledAgent`.

This restores the status quo ante from #4682 in anticipation of reaching
a better overall mechanism for dealing with failures to communicate
about instances with sled agents. See #3206, #3238, and #4226 for more
discussion.

Tests: new integration test; stood up a dev cluster, started an
instance, killed the zone with `zoneadm halt`, and verified that calls
to reboot/stop the instance eventually marked it as Failed (due to a
timeout attempting to contact the Propolis zone).

Fixes #4709.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resource accounting error after failed instance-start
2 participants