Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus should restart Failed instances when boot_on_fault says to #6491

Closed
hawkw opened this issue Aug 30, 2024 · 3 comments
Closed

Nexus should restart Failed instances when boot_on_fault says to #6491

hawkw opened this issue Aug 30, 2024 · 3 comments
Assignees
Labels
nexus Related to nexus

Comments

@hawkw
Copy link
Member

hawkw commented Aug 30, 2024

Depends on #6455 (and probably also #6490).

Per RFD 486:

An instance’s boot_on_fault discipline tells Nexus whether to try to recover after retiring a failed VMM. The options are to do nothing (the default) or to try to restart the instance automatically.

We should implement that.

Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to Failed. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in the Failed state and have boot_on_fault disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance to Failed could just activate that background task.

@hawkw hawkw added the nexus Related to nexus label Aug 30, 2024
@hawkw hawkw self-assigned this Aug 30, 2024
@gjcolombo
Copy link
Contributor

Related: #4872

hawkw added a commit that referenced this issue Sep 4, 2024
Currently, the `instance` table has a `boot_on_fault` column, which is a
`bool`. This is intended to indicate whether an instance should be
automatically restarted in response to failures, although currently, it
isn't actually consumed by anything. However, as determined in RFD 486,
Nexus will eventually grow functionality for automatically restarting
instances that have transitioned to the `Failed` state, if
`boot_on_fault` is set.

@gjcolombo suggests that this field should probably be replaced with an
enum, rather than a `bool`. This way, we could represent more
boot-on-fault disciplines than "always reboot on faults" and "never do
that". Since nothing is currently consuming this field, it's probably a
good idea to go ahead and do the requisite SQL surgery to turn it into
an enum _now_, before we write code that actually consumes
it...especially since I'm planning on actually doing that next (see
#6491).

This commit replaces the `boot_on_fault` column with a new
`auto_restart_policy` column. The type for this column is a newly-added
`InstanceAutoRestart` enum, which currently has variants for
`AllFailures` (the instance should be automatically restarted any time
it fails), `SledFailuresOnly`, (the instance should be restarted if the
sled it's on reboots or is expunged, but it should *not* be restarted if
its individual VMM process crashes), and `Never` (the instance should
never be automatically restarted). The database migration adding
`auto_restart_policy` backfills any existing `boot_on_fault: true`
instances with `InstanceAutoRestart::AllFailures` (although, because
users can't currently set a value for `boot_on_fault`, there should
usually be no such instances). For instances with `boot_on_fault:
false`, the auto-restart policy is left `NULL`; in the future, we can
determine what the default policy should be (perhaps at the
project-level) and backfill these `NULL`s as well. For now, instances
with an unset (`NULL`) policy will not be restarted.

Closes #6490
@askfongjojo
Copy link

@hawkw - Is this considered done? Or we're using this issue to track the future work of making boot_on_fault configurable by user? (there may already be a ticket for that but I haven't located that yet)

@hawkw
Copy link
Member Author

hawkw commented Sep 27, 2024

This is done --- can't believe I opened this issue and forgot to close it. Whoops!

@hawkw hawkw closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

No branches or pull requests

3 participants