-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Turn instance.boot_on_fault
into an enum
#6499
Conversation
Most of this change is pretty straightforward; while I welcome input from reviewers on what to name the new fields, most of the complexity is just the database migration to replace |
7b9f541
to
a89df2f
Compare
Note that this PR is currently stacked on top of #6455, because I wanted to start working on a change that depends on both of them. |
Hmm, I'm not sure whether this CI failure is my fault...will have to investigate deeper: https://buildomat.eng.oxide.computer/wg/0/details/01J6Q8748MZKW18KSEF4540CQZ/iy5XpuAznwyLAetcf2Mfjh1o9XO11hzknMn8Mr0yhl9xfXK3/01J6Q87HTCRRS09RECVKRHBBSR#S5113 |
01454f5
to
1b3fcad
Compare
f42879c
to
24b49e5
Compare
Rebased the branch and it went away, presumably this was a flake: #6506 |
Reading over #4872, it occurs to me that we may not actually want the default enum InstanceAutoRestart {
Never,
#[default]
SledFailuresOnly,
AllFailures,
} Unfortunately, we don't currently have a way for Nexus differentiate between instances that have disappeared because their whole sled rebooted and instances that disappeared because their individual VMM crashed, so actually implementing the "sled failures only" policy will require a bit more work. But, maybe we should still make it the default now so that existing instances get that behavior once we can actually implement it? cc @gjcolombo @davepacheco what do you think? |
24b49e5
to
9ca4c5e
Compare
637b780
to
2db6eff
Compare
I went ahead and did this in 2db6eff, because I think that if we want to implement auto-restart on sled reboots, and we want it to be the default behavior, we ought to make sure it's the default now for backfilling the auto-restart policy on current instance records. If others disagree that this is the right default, I'm happ yto undo that change. |
2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)
I think the right default is whatever users will find least surprising :) Going into this, I would've said that "never" is the right default, because my mental model is roughly "I should have to command the control plane to start a stopped instance," and if I don't enable auto-restart for a failed instance, then I have issued no such command. But I suspect I'm in the minority here and that most users will be less surprised by a model that says, "once you've asked for an instance to be running, the control plane tries to keep it running until you ask it not to be running anymore." (FWIW this appears to be the default model in the big three public clouds.) So, all told: I think I'm the weird one here and am OK with making "reboot on host failure" the default option. One tangential note about rebooting on VMM failure: I'm fine with having the enum variant here, but before we wire it up and expose it, I'd like to understand how we recover if some instance gets stuck in a panic loop (i.e. where something is wrong with a specific instance that makes Propolis deterministically panic when trying to start it). Such an issue would be a Propolis bug, of course. We don't have to (and probably shouldn't) try to solve that in this thread, but I'd like to put a pin in it. |
FWIW, my initial thinking was also that "never" should be the default, because I don't love the idea of implicitly going and doing stuff the user hasn't explicitly asked for. But, yeah, thinking about it, isn't the whole point of "the cloud" that the user should be insulated from host failures? You're absolutely correct that the major public clouds don't generally require you to manually restart your VMMs if one of their hosts reboots --- in many cases, they don't even expose host reboots to the user at all! I also think that, if
Yeah, in the long term, I think we probably want our restart policies to include a limit on the number of consecutive automatic restarts due to the same class of failure. But, depending on how granular we want those limits to be, we probably want some notion of "failure reasons" in the control plane...that could just be separating "single VMM failures" from "sled rebooted", which will be necessary in order to actually implement the "sled failures only" policy anyway... |
i'd agree with this in a different way: rebooting on sled failure feels morally similar to configuring a machine to power on after power loss is detected. that, in turn, is a pretty common intentional choice IME? |
This raises another question, I think. The existing implicit setting here is "never." Currently we have no instance-update API and so have no way for a user to change the setting. Should we wait to change the behavior for existing instances until users have a way to change it back? (There are other good reasons to have an instance update API, e.g. #3769, so there are lots of additional rewards to be gained from that particular side quest.) |
That is fair, perhaps we should default to "never", allow selecting an alternative policy for creating new instances, and then add an API to change the setting for existing instances. Leaving the default as "never" does seem like the safest choice, but until we add an instance-update API1, this will force users to delete existing instances and recreate them just to get the "restart automatically if the sled reboots" behavior, which I do still think is probably what a substantial majority of users will want for a substantial majority of instances... Footnotes
|
@gjcolombo okay, as we discussed, I've gone back to making |
For what it's worth, my expectations match @iximeow's: this is analogous to power-on-after-power-loss to me. I'd be surprised if a VM I had provisioned was just not running one day because of a server reboot. Also, I think the auto-restart behavior can help smooth over other problems -- e.g., one day we will have some problem where sleds panic occasionally, and it will be much worse if end users have to take action whenever that happens to keep their stuff running. I do think it'll be important to deal with the consecutive-failed-start problem. That could easily become a DoS (intentional or otherwise) even if it's just a propolis crash. Imagine if a bug causes the VM to trigger a host OS kernel panic. We had more than one issue like this at Joyent, where some workload triggered an OS panic, so the system moved it to another host, which triggered another panic, .... It was ugly. I'd consider a very simple "no more than X restarts in Y minutes" policy, or even "no more than one restart within X minutes", which would only require storing one What do you think about having a database value called |
This seems like a pretty appealing compromise, honestly --- I suppose there's no actual reason we have to backfill it except for "it would be nice to make the column non-null", but...it would also be fine to leave it NULL for now, and if we decided to commit to a default behavior later, we could then go back and backfill the NULLs with that default...or not do that. You've sold me! |
Okay, 86b0ae6 makes the auto-restart policy nullable, and leaves it NULL for all instances where |
2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)
Commit 86b0ae6 changed the default value for `instance.auto_restart_policy` to `NULL` when it hasn't been explicitly specified, so this branch has to be updated to track that. See #6499 (comment)
2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)
Commit 86b0ae6 changed the default value for `instance.auto_restart_policy` to `NULL` when it hasn't been explicitly specified, so this branch has to be updated to track that. See #6499 (comment)
2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)
Commit 86b0ae6 changed the default value for `instance.auto_restart_policy` to `NULL` when it hasn't been explicitly specified, so this branch has to be updated to track that. See #6499 (comment)
2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)
Commit 86b0ae6 changed the default value for `instance.auto_restart_policy` to `NULL` when it hasn't been explicitly specified, so this branch has to be updated to track that. See #6499 (comment)
2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)
Commit 86b0ae6 changed the default value for `instance.auto_restart_policy` to `NULL` when it hasn't been explicitly specified, so this branch has to be updated to track that. See #6499 (comment)
2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)
Commit 86b0ae6 changed the default value for `instance.auto_restart_policy` to `NULL` when it hasn't been explicitly specified, so this branch has to be updated to track that. See #6499 (comment)
As of #6455, instances in the `Failed` state are allowed to be restarted. As described in [RFD 486 § 4.3], Nexus should automatically restart such instances when the instance record's `auto_restart_policy` field (née `boot_on_fault`, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a new `instance_reincarnation` RPW. This RPW consists of a background task which queries CRDB for instances in the `Failed` state which are eligible to be automatically restarted, and starts a new `instance-start` saga for any instances it discovers. Instances are considered eligible for reincarnation if all of the following conditions are true: - **The instance is in the `InstanceState::Failed` state.** Meaning, it must have *no active VMM* --- if the instance is in the `InstanceState::Vmm` state with an active VMM which is in `VmmState::Failed`, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned to `Failed` before attempting to restart it. - **The instance's `auto_restart_policy` permits it to be restarted.** The `auto_restart_policy` enum is changed from representing the classess of failures on which an instance may be restarted (`never`, `sled_failures_only`, and `all_failures`) was changed to a more general quality-of-service for automatic restarts. Presently, this can either be `never`, meaning that the instance can never be restarted automatically; or `best_effort`, meaning that the control plane should try to keep the instance running but is permitted to not restart it if necessary to preserve the stability of the whole system. The default policy for instances which don't provide one is `best_effort`. - **A cooldown period has elapsed since the instance's last automatic restart.**. In order to prevent instances which fail frequently from compromising the reliability of the system as a whole, we enforce a cooldown period between automatic restarts. This is tracked by setting a last-auto-restart timestamp in the `instance` record. At present, the cooldown period is always one hour unless the instance record overrides it, which currently can only be done using a test-only Nexus method. In the future, we may consider allowing the cooldown period to be configured by users, but because it presents a denial-of-service risk, setting it should probably be a more priveliged operation than creating an instance. The implementation of the actual reincarnation background task is relatively straightforward: it runs a new `DataStore::find_reincarnatable_instances` query, which returns instances in the `Failed` state, and it spawns `instance-start` sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a new `time_last_auto_restarted` field to the `instance` table and to the `InstanceRuntimeState` type. We add a `Reason` enum to the `instance-start` saga's `Params` that indicates why the instance is being started, and if it is `Reason::AutoRestart`, then the start saga will set the `time_last_auto_restarted` timestamp when moving the instance record to the `Starting` state. Additionally, I've added the `auto_restart_policy` field to the `params::InstanceCreate` type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to a `params::InstanceCreate` literal. This branch adds a minimal, initial implementation of instance auto-restarting. [RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances Fixes #5553
Currently, the
instance
table has aboot_on_fault
column, which is abool
. This is intended to indicate whether an instance should beautomatically restarted in response to failures, although currently, it
isn't actually consumed by anything. However, as determined in RFD 486,
Nexus will eventually grow functionality for automatically restarting
instances that have transitioned to the
Failed
state, ifboot_on_fault
is set.@gjcolombo suggests that this field should probably be replaced with an
enum, rather than a
bool
. This way, we could represent moreboot-on-fault disciplines than "always reboot on faults" and "never do
that". Since nothing is currently consuming this field, it's probably a
good idea to go ahead and do the requisite SQL surgery to turn it into
an enum now, before we write code that actually consumes
it...especially since I'm planning on actually doing that next (see
#6491).
This commit replaces the
boot_on_fault
column with a newauto_restart_policy
column. The type for this column is a newly-addedInstanceAutoRestart
enum, which currently has variants forAllFailures
(the instance should be automatically restarted any timeit fails) and
Never
(the instance should never be automaticallyrestarted). The database migration adding
auto_restart_policy
backfills any existing
boot_on_fault: true
instances withInstanceAutoRestart::AllFailures
, andboot_on_fault: false
withInstanceAutoRestart::Never
.Closes #6490