Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Allow stopping Failed instances #6652

Merged
merged 2 commits into from
Sep 24, 2024
Merged

[nexus] Allow stopping Failed instances #6652

merged 2 commits into from
Sep 24, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Sep 24, 2024

PR #6503 changed Nexus to attempt to automatically restart instances which are in the Failed state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is Failed, as a way to say "stop trying to restart this instance" (as Stopped instances are not restarted). This branch changes Nexus::instance_request_state and
select_instance_change_action to permit stopping a Failed instance.

Fixes #6640
Fixes #2825, along with #6455 (which allowed restarting Failed instances).

PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
@hawkw hawkw requested a review from gjcolombo September 24, 2024 17:27
@hawkw hawkw self-assigned this Sep 24, 2024
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member Author

hawkw commented Sep 24, 2024

@david-crespo I was just about to leave a comment letting you know about this change, but it looks like you were way ahead of me with oxidecomputer/console#2468. Nice :)

@david-crespo
Copy link
Contributor

My email inbox is a nightmare but I am up to date

@hawkw
Copy link
Member Author

hawkw commented Sep 24, 2024

The CI failure is due to an unexpected Tokio task cancellation in test_omdb_success_cases, which is a known test flake --- see #6505. I'm going to restart that run.

hawkw added a commit that referenced this pull request Sep 24, 2024
Currently, instances whose active VMM is `SagaUnwound` appear externally
as `Stopped`. We decided to report them as `Stopped` because start sagas
are permitted to run for instances with `SagaUnwound` active VMMs, and
--- at the time when the `SagaUnwound` VMM state was introduced,
`Failed` instances could not be started. However, #6455 added the
ability to restart `Failed` instances, and #6652 will permit them to be
stopped. Therefore, we should recast instances with `SagaUnwound` active
VMMs as `Failed`: they weren't asked politely to stop; instead, we
attempted to start them and something went wrong...which sounds like
`Failed` to me.

This becomes more important in light of #6638: if we will attempt
automatically restart such instances, they should definitely appear
to be `Failed`. The distinction between `Failed` and `Stopped` becomes
that `Failed` means "this thing isn't running, but it's supposed to be;
we may try to fix that for you if permitted to do so", while `Stopped`
means "this thing isn't running and that's fine, because you asked for
it to no longer be running". Thus, this commit changes `SagaUnwound`
VMMs to appear `Failed` externally.
nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
@hawkw hawkw enabled auto-merge (squash) September 24, 2024 22:22
@hawkw hawkw merged commit 0c7fb27 into main Sep 24, 2024
16 checks passed
@hawkw hawkw deleted the eliza/stop-failing branch September 24, 2024 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[nexus] want to allow stopping Failed instances Failed instances should be allowed to stop and restart
3 participants