-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus should restart Failed
instances when boot_on_fault
says to
#6491
Labels
nexus
Related to nexus
Comments
Related: #4872 |
This was referenced Aug 30, 2024
hawkw
added a commit
that referenced
this issue
Sep 4, 2024
Currently, the `instance` table has a `boot_on_fault` column, which is a `bool`. This is intended to indicate whether an instance should be automatically restarted in response to failures, although currently, it isn't actually consumed by anything. However, as determined in RFD 486, Nexus will eventually grow functionality for automatically restarting instances that have transitioned to the `Failed` state, if `boot_on_fault` is set. @gjcolombo suggests that this field should probably be replaced with an enum, rather than a `bool`. This way, we could represent more boot-on-fault disciplines than "always reboot on faults" and "never do that". Since nothing is currently consuming this field, it's probably a good idea to go ahead and do the requisite SQL surgery to turn it into an enum _now_, before we write code that actually consumes it...especially since I'm planning on actually doing that next (see #6491). This commit replaces the `boot_on_fault` column with a new `auto_restart_policy` column. The type for this column is a newly-added `InstanceAutoRestart` enum, which currently has variants for `AllFailures` (the instance should be automatically restarted any time it fails), `SledFailuresOnly`, (the instance should be restarted if the sled it's on reboots or is expunged, but it should *not* be restarted if its individual VMM process crashes), and `Never` (the instance should never be automatically restarted). The database migration adding `auto_restart_policy` backfills any existing `boot_on_fault: true` instances with `InstanceAutoRestart::AllFailures` (although, because users can't currently set a value for `boot_on_fault`, there should usually be no such instances). For instances with `boot_on_fault: false`, the auto-restart policy is left `NULL`; in the future, we can determine what the default policy should be (perhaps at the project-level) and backfill these `NULL`s as well. For now, instances with an unset (`NULL`) policy will not be restarted. Closes #6490
@hawkw - Is this considered done? Or we're using this issue to track the future work of making |
This is done --- can't believe I opened this issue and forgot to close it. Whoops! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Depends on #6455 (and probably also #6490).
Per RFD 486:
We should implement that.
Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to
Failed
. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in theFailed
state and haveboot_on_fault
disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance toFailed
could just activate that background task.The text was updated successfully, but these errors were encountered: