Nexus should restart `Failed` instances when `boot_on_fault` says to #6491

hawkw · 2024-08-30T16:05:41Z

Depends on #6455 (and probably also #6490).

Per RFD 486:

An instance’s boot_on_fault discipline tells Nexus whether to try to recover after retiring a failed VMM. The options are to do nothing (the default) or to try to restart the instance automatically.

We should implement that.

Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to Failed. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in the Failed state and have boot_on_fault disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance to Failed could just activate that background task.

The text was updated successfully, but these errors were encountered:

gjcolombo · 2024-08-30T16:07:52Z

Related: #4872

@gjcolombo

Currently, the `instance` table has a `boot_on_fault` column, which is a `bool`. This is intended to indicate whether an instance should be automatically restarted in response to failures, although currently, it isn't actually consumed by anything. However, as determined in RFD 486, Nexus will eventually grow functionality for automatically restarting instances that have transitioned to the `Failed` state, if `boot_on_fault` is set. @gjcolombo suggests that this field should probably be replaced with an enum, rather than a `bool`. This way, we could represent more boot-on-fault disciplines than "always reboot on faults" and "never do that". Since nothing is currently consuming this field, it's probably a good idea to go ahead and do the requisite SQL surgery to turn it into an enum _now_, before we write code that actually consumes it...especially since I'm planning on actually doing that next (see #6491). This commit replaces the `boot_on_fault` column with a new `auto_restart_policy` column. The type for this column is a newly-added `InstanceAutoRestart` enum, which currently has variants for `AllFailures` (the instance should be automatically restarted any time it fails), `SledFailuresOnly`, (the instance should be restarted if the sled it's on reboots or is expunged, but it should *not* be restarted if its individual VMM process crashes), and `Never` (the instance should never be automatically restarted). The database migration adding `auto_restart_policy` backfills any existing `boot_on_fault: true` instances with `InstanceAutoRestart::AllFailures` (although, because users can't currently set a value for `boot_on_fault`, there should usually be no such instances). For instances with `boot_on_fault: false`, the auto-restart policy is left `NULL`; in the future, we can determine what the default policy should be (perhaps at the project-level) and backfill these `NULL`s as well. For now, instances with an unset (`NULL`) policy will not be restarted. Closes #6490

askfongjojo · 2024-09-27T19:55:07Z

@hawkw - Is this considered done? Or we're using this issue to track the future work of making boot_on_fault configurable by user? (there may already be a ticket for that but I haven't located that yet)

hawkw · 2024-09-27T20:53:45Z

This is done --- can't believe I opened this issue and forgot to close it. Whoops!

hawkw added the nexus Related to nexus label Aug 30, 2024

hawkw self-assigned this Aug 30, 2024

This was referenced Aug 30, 2024

[nexus] handle sled-agent errors as described in RFD 486 #6455

Merged

[nexus] Turn instance.boot_on_fault into an enum #6499

Merged

hawkw closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus should restart `Failed` instances when `boot_on_fault` says to #6491

Nexus should restart `Failed` instances when `boot_on_fault` says to #6491

hawkw commented Aug 30, 2024

gjcolombo commented Aug 30, 2024

askfongjojo commented Sep 27, 2024

hawkw commented Sep 27, 2024

Nexus should restart Failed instances when boot_on_fault says to #6491

Nexus should restart Failed instances when boot_on_fault says to #6491

Comments

hawkw commented Aug 30, 2024

gjcolombo commented Aug 30, 2024

askfongjojo commented Sep 27, 2024

hawkw commented Sep 27, 2024

Nexus should restart `Failed` instances when `boot_on_fault` says to #6491

Nexus should restart `Failed` instances when `boot_on_fault` says to #6491