Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Turn instance.boot_on_fault into an enum #6499

Merged
merged 8 commits into from
Sep 4, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 31, 2024

Currently, the instance table has a boot_on_fault column, which is a
bool. This is intended to indicate whether an instance should be
automatically restarted in response to failures, although currently, it
isn't actually consumed by anything. However, as determined in RFD 486,
Nexus will eventually grow functionality for automatically restarting
instances that have transitioned to the Failed state, if
boot_on_fault is set.

@gjcolombo suggests that this field should probably be replaced with an
enum, rather than a bool. This way, we could represent more
boot-on-fault disciplines than "always reboot on faults" and "never do
that". Since nothing is currently consuming this field, it's probably a
good idea to go ahead and do the requisite SQL surgery to turn it into
an enum now, before we write code that actually consumes
it...especially since I'm planning on actually doing that next (see
#6491).

This commit replaces the boot_on_fault column with a new
auto_restart_policy column. The type for this column is a newly-added
InstanceAutoRestart enum, which currently has variants for
AllFailures (the instance should be automatically restarted any time
it fails) and Never (the instance should never be automatically
restarted). The database migration adding auto_restart_policy
backfills any existing boot_on_fault: true instances with
InstanceAutoRestart::AllFailures, and boot_on_fault: false with
InstanceAutoRestart::Never.

Closes #6490

@hawkw hawkw requested review from smklein and gjcolombo August 31, 2024 16:05
@hawkw
Copy link
Member Author

hawkw commented Aug 31, 2024

Most of this change is pretty straightforward; while I welcome input from reviewers on what to name the new fields, most of the complexity is just the database migration to replace boot_on_fault with auto_restart_policy. I'm pretty sure I've done that correctly, but I'd love extra attention on that part.

@hawkw hawkw force-pushed the eliza/boot-on-fault-isnt-just-a-bool branch from 7b9f541 to a89df2f Compare August 31, 2024 16:42
@hawkw
Copy link
Member Author

hawkw commented Aug 31, 2024

Note that this PR is currently stacked on top of #6455, because I wanted to start working on a change that depends on both of them.

@hawkw hawkw changed the base branch from main to eliza/486-sa-errors August 31, 2024 16:43
@hawkw
Copy link
Member Author

hawkw commented Sep 1, 2024

@hawkw
Copy link
Member Author

hawkw commented Sep 2, 2024

Hmm, I'm not sure whether this CI failure is my fault...will have to investigate deeper: https://buildomat.eng.oxide.computer/wg/0/details/01J6Q8748MZKW18KSEF4540CQZ/iy5XpuAznwyLAetcf2Mfjh1o9XO11hzknMn8Mr0yhl9xfXK3/01J6Q87HTCRRS09RECVKRHBBSR#S5113

Rebased the branch and it went away, presumably this was a flake: #6506

@hawkw
Copy link
Member Author

hawkw commented Sep 2, 2024

Reading over #4872, it occurs to me that we may not actually want the default InstanceAutoRestart policy to be Never --- maybe the default that we backfill existing boot_on_fault: false instances with ought to be "restart the instance if the sled it was on rebooted, but not if the propolis-server process crashed"? So, maybe we should have variants like:

enum InstanceAutoRestart {
   Never,
   #[default]
   SledFailuresOnly,
   AllFailures,
}

Unfortunately, we don't currently have a way for Nexus differentiate between instances that have disappeared because their whole sled rebooted and instances that disappeared because their individual VMM crashed, so actually implementing the "sled failures only" policy will require a bit more work. But, maybe we should still make it the default now so that existing instances get that behavior once we can actually implement it?

cc @gjcolombo @davepacheco what do you think?

Base automatically changed from eliza/486-sa-errors to main September 2, 2024 18:41
@hawkw hawkw force-pushed the eliza/boot-on-fault-isnt-just-a-bool branch from 24b49e5 to 9ca4c5e Compare September 2, 2024 18:43
@hawkw hawkw force-pushed the eliza/boot-on-fault-isnt-just-a-bool branch from 637b780 to 2db6eff Compare September 2, 2024 18:52
@hawkw
Copy link
Member Author

hawkw commented Sep 2, 2024

Reading over #4872, it occurs to me that we may not actually want the default InstanceAutoRestart policy to be Never --- maybe the default that we backfill existing boot_on_fault: false instances with ought to be "restart the instance if the sled it was on rebooted, but not if the propolis-server process crashed"? So, maybe we should have variants like:

enum InstanceAutoRestart {
   Never,
   #[default]
   SledFailuresOnly,
   AllFailures,
}

...

I went ahead and did this in 2db6eff, because I think that if we want to implement auto-restart on sled reboots, and we want it to be the default behavior, we ought to make sure it's the default now for backfilling the auto-restart policy on current instance records. If others disagree that this is the right default, I'm happ yto undo that change.

@hawkw hawkw requested a review from davepacheco September 2, 2024 18:58
hawkw added a commit that referenced this pull request Sep 2, 2024
2db6eff added a `SledFailuresOnly`
auto-restart policy in addition to `Never` and `AllFailures`. I
discussed the rationale for that in [this comment][1]. Currently, there
isn't a mechanism to detect whether an instance is `Failed` because the
individual instance crashed or because the whole sled was restarted, so
for now, we assume all failures are instance-level. But, we still need
to handle the new variant.

[1]: #6499 (comment)
@gjcolombo
Copy link
Contributor

I think the right default is whatever users will find least surprising :)

Going into this, I would've said that "never" is the right default, because my mental model is roughly "I should have to command the control plane to start a stopped instance," and if I don't enable auto-restart for a failed instance, then I have issued no such command. But I suspect I'm in the minority here and that most users will be less surprised by a model that says, "once you've asked for an instance to be running, the control plane tries to keep it running until you ask it not to be running anymore." (FWIW this appears to be the default model in the big three public clouds.)

So, all told: I think I'm the weird one here and am OK with making "reboot on host failure" the default option.


One tangential note about rebooting on VMM failure: I'm fine with having the enum variant here, but before we wire it up and expose it, I'd like to understand how we recover if some instance gets stuck in a panic loop (i.e. where something is wrong with a specific instance that makes Propolis deterministically panic when trying to start it). Such an issue would be a Propolis bug, of course. We don't have to (and probably shouldn't) try to solve that in this thread, but I'd like to put a pin in it.

@hawkw
Copy link
Member Author

hawkw commented Sep 3, 2024

I think the right default is whatever users will find least surprising :)

Going into this, I would've said that "never" is the right default, because my mental model is roughly "I should have to command the control plane to start a stopped instance," and if I don't enable auto-restart for a failed instance, then I have issued no such command. But I suspect I'm in the minority here and that most users will be less surprised by a model that says, "once you've asked for an instance to be running, the control plane tries to keep it running until you ask it not to be running anymore." (FWIW this appears to be the default model in the big three public clouds.)

So, all told: I think I'm the weird one here and am OK with making "reboot on host failure" the default option.

FWIW, my initial thinking was also that "never" should be the default, because I don't love the idea of implicitly going and doing stuff the user hasn't explicitly asked for. But, yeah, thinking about it, isn't the whole point of "the cloud" that the user should be insulated from host failures? You're absolutely correct that the major public clouds don't generally require you to manually restart your VMMs if one of their hosts reboots --- in many cases, they don't even expose host reboots to the user at all!

I also think that, if instance.boot_on_fault was actually exposed to the user when creating an instance, I would be more comfortable with making "never" the default --- if instances with boot_on_fault: false were that way because the user had not checked a "reboot on fault" box when creating the instance, maybe backfilling auto-restart policies should only enable boot on fault for instances where the user did request it. But all instances currently running on user racks have boot_on_fault: false because (AFAICT) there was no way for users to request otherwise...

One tangential note about rebooting on VMM failure: I'm fine with having the enum variant here, but before we wire it up and expose it, I'd like to understand how we recover if some instance gets stuck in a panic loop (i.e. where something is wrong with a specific instance that makes Propolis deterministically panic when trying to start it). Such an issue would be a Propolis bug, of course. We don't have to (and probably shouldn't) try to solve that in this thread, but I'd like to put a pin in it.

Yeah, in the long term, I think we probably want our restart policies to include a limit on the number of consecutive automatic restarts due to the same class of failure. But, depending on how granular we want those limits to be, we probably want some notion of "failure reasons" in the control plane...that could just be separating "single VMM failures" from "sled rebooted", which will be necessary in order to actually implement the "sled failures only" policy anyway...

@iximeow
Copy link
Member

iximeow commented Sep 3, 2024

most users will be less surprised by a model that says, "once you've asked for an instance to be running, the control plane tries to keep it running until you ask it not to be running anymore."

i'd agree with this in a different way: rebooting on sled failure feels morally similar to configuring a machine to power on after power loss is detected. that, in turn, is a pretty common intentional choice IME?

@gjcolombo
Copy link
Contributor

I also think that, if instance.boot_on_fault was actually exposed to the user when creating an instance, I would be more comfortable with making "never" the default --- if instances with boot_on_fault: false were that way because the user had not checked a "reboot on fault" box when creating the instance, maybe backfilling auto-restart policies should only enable boot on fault for instances where the user did request it. But all instances currently running on user racks have boot_on_fault: false because (AFAICT) there was no way for users to request otherwise...

This raises another question, I think. The existing implicit setting here is "never." Currently we have no instance-update API and so have no way for a user to change the setting. Should we wait to change the behavior for existing instances until users have a way to change it back? (There are other good reasons to have an instance update API, e.g. #3769, so there are lots of additional rewards to be gained from that particular side quest.)

@hawkw
Copy link
Member Author

hawkw commented Sep 3, 2024

I also think that, if instance.boot_on_fault was actually exposed to the user when creating an instance, I would be more comfortable with making "never" the default --- if instances with boot_on_fault: false were that way because the user had not checked a "reboot on fault" box when creating the instance, maybe backfilling auto-restart policies should only enable boot on fault for instances where the user did request it. But all instances currently running on user racks have boot_on_fault: false because (AFAICT) there was no way for users to request otherwise...

This raises another question, I think. The existing implicit setting here is "never." Currently we have no instance-update API and so have no way for a user to change the setting. Should we wait to change the behavior for existing instances until users have a way to change it back? (There are other good reasons to have an instance update API, e.g. #3769, so there are lots of additional rewards to be gained from that particular side quest.)

That is fair, perhaps we should default to "never", allow selecting an alternative policy for creating new instances, and then add an API to change the setting for existing instances. Leaving the default as "never" does seem like the safest choice, but until we add an instance-update API1, this will force users to delete existing instances and recreate them just to get the "restart automatically if the sled reboots" behavior, which I do still think is probably what a substantial majority of users will want for a substantial majority of instances...

Footnotes

  1. Which...we probably should call something other than that, given that "instance update" now means a very different thing. Maybe "instance modify" or something?

@hawkw
Copy link
Member Author

hawkw commented Sep 3, 2024

@gjcolombo okay, as we discussed, I've gone back to making Never the default for now. We can figure out how to specify reboot policies for new instances later.

@davepacheco
Copy link
Collaborator

For what it's worth, my expectations match @iximeow's: this is analogous to power-on-after-power-loss to me. I'd be surprised if a VM I had provisioned was just not running one day because of a server reboot. Also, I think the auto-restart behavior can help smooth over other problems -- e.g., one day we will have some problem where sleds panic occasionally, and it will be much worse if end users have to take action whenever that happens to keep their stuff running.

I do think it'll be important to deal with the consecutive-failed-start problem. That could easily become a DoS (intentional or otherwise) even if it's just a propolis crash. Imagine if a bug causes the VM to trigger a host OS kernel panic. We had more than one issue like this at Joyent, where some workload triggered an OS panic, so the system moved it to another host, which triggered another panic, .... It was ugly. I'd consider a very simple "no more than X restarts in Y minutes" policy, or even "no more than one restart within X minutes", which would only require storing one time_last_autostart_attempted timestamp.

What do you think about having a database value called Unset (or allowing it to be NULL)? That would allow future-us to tell the difference between instances whose policy was explicitly set by the user (which we'd never want to change) vs. ones that we set based on what we thought would make a good default. More to the point: that would allow us to say that right now the behavior is that if you haven't set this, we don't restart anything (for the reasons mentioned about not having a way to change it and not wanting to break existing behavior) but in a future release (when we've fixed that) we could change the default policy while still honoring anything that somebody had set explicitly. Is that overdoing it?

@hawkw
Copy link
Member Author

hawkw commented Sep 3, 2024

What do you think about having a database value called Unset (or allowing it to be NULL)? That would allow future-us to tell the difference between instances whose policy was explicitly set by the user (which we'd never want to change) vs. ones that we set based on what we thought would make a good default. More to the point: that would allow us to say that right now the behavior is that if you haven't set this, we don't restart anything (for the reasons mentioned about nothaving a way to change it and not wanting to break existing behavior) but in a future release (when we've fixed that) we could change the default policy while still honoring anything that somebody had set explicitly. Is that overdoing it?

This seems like a pretty appealing compromise, honestly --- I suppose there's no actual reason we have to backfill it except for "it would be nice to make the column non-null", but...it would also be fine to leave it NULL for now, and if we decided to commit to a default behavior later, we could then go back and backfill the NULLs with that default...or not do that. You've sold me!

@hawkw
Copy link
Member Author

hawkw commented Sep 4, 2024

Okay, 86b0ae6 makes the auto-restart policy nullable, and leaves it NULL for all instances where boot_on_fault is not true (which, in practice, should be all of 'em). Let me know what y'all think!

hawkw added a commit that referenced this pull request Sep 4, 2024
2db6eff added a `SledFailuresOnly`
auto-restart policy in addition to `Never` and `AllFailures`. I
discussed the rationale for that in [this comment][1]. Currently, there
isn't a mechanism to detect whether an instance is `Failed` because the
individual instance crashed or because the whole sled was restarted, so
for now, we assume all failures are instance-level. But, we still need
to handle the new variant.

[1]: #6499 (comment)
hawkw added a commit that referenced this pull request Sep 4, 2024
Commit 86b0ae6 changed the default
value for `instance.auto_restart_policy` to `NULL` when it hasn't been
explicitly specified, so this branch has to be updated to track that.
See #6499 (comment)
nexus/db-model/src/instance.rs Show resolved Hide resolved
@hawkw hawkw merged commit aca08df into main Sep 4, 2024
22 checks passed
@hawkw hawkw deleted the eliza/boot-on-fault-isnt-just-a-bool branch September 4, 2024 20:30
hawkw added a commit that referenced this pull request Sep 4, 2024
2db6eff added a `SledFailuresOnly`
auto-restart policy in addition to `Never` and `AllFailures`. I
discussed the rationale for that in [this comment][1]. Currently, there
isn't a mechanism to detect whether an instance is `Failed` because the
individual instance crashed or because the whole sled was restarted, so
for now, we assume all failures are instance-level. But, we still need
to handle the new variant.

[1]: #6499 (comment)
hawkw added a commit that referenced this pull request Sep 4, 2024
Commit 86b0ae6 changed the default
value for `instance.auto_restart_policy` to `NULL` when it hasn't been
explicitly specified, so this branch has to be updated to track that.
See #6499 (comment)
hawkw added a commit that referenced this pull request Sep 10, 2024
2db6eff added a `SledFailuresOnly`
auto-restart policy in addition to `Never` and `AllFailures`. I
discussed the rationale for that in [this comment][1]. Currently, there
isn't a mechanism to detect whether an instance is `Failed` because the
individual instance crashed or because the whole sled was restarted, so
for now, we assume all failures are instance-level. But, we still need
to handle the new variant.

[1]: #6499 (comment)
hawkw added a commit that referenced this pull request Sep 10, 2024
Commit 86b0ae6 changed the default
value for `instance.auto_restart_policy` to `NULL` when it hasn't been
explicitly specified, so this branch has to be updated to track that.
See #6499 (comment)
hawkw added a commit that referenced this pull request Sep 11, 2024
2db6eff added a `SledFailuresOnly`
auto-restart policy in addition to `Never` and `AllFailures`. I
discussed the rationale for that in [this comment][1]. Currently, there
isn't a mechanism to detect whether an instance is `Failed` because the
individual instance crashed or because the whole sled was restarted, so
for now, we assume all failures are instance-level. But, we still need
to handle the new variant.

[1]: #6499 (comment)
hawkw added a commit that referenced this pull request Sep 11, 2024
Commit 86b0ae6 changed the default
value for `instance.auto_restart_policy` to `NULL` when it hasn't been
explicitly specified, so this branch has to be updated to track that.
See #6499 (comment)
hawkw added a commit that referenced this pull request Sep 13, 2024
2db6eff added a `SledFailuresOnly`
auto-restart policy in addition to `Never` and `AllFailures`. I
discussed the rationale for that in [this comment][1]. Currently, there
isn't a mechanism to detect whether an instance is `Failed` because the
individual instance crashed or because the whole sled was restarted, so
for now, we assume all failures are instance-level. But, we still need
to handle the new variant.

[1]: #6499 (comment)
hawkw added a commit that referenced this pull request Sep 13, 2024
Commit 86b0ae6 changed the default
value for `instance.auto_restart_policy` to `NULL` when it hasn't been
explicitly specified, so this branch has to be updated to track that.
See #6499 (comment)
hawkw added a commit that referenced this pull request Sep 16, 2024
2db6eff added a `SledFailuresOnly`
auto-restart policy in addition to `Never` and `AllFailures`. I
discussed the rationale for that in [this comment][1]. Currently, there
isn't a mechanism to detect whether an instance is `Failed` because the
individual instance crashed or because the whole sled was restarted, so
for now, we assume all failures are instance-level. But, we still need
to handle the new variant.

[1]: #6499 (comment)
hawkw added a commit that referenced this pull request Sep 16, 2024
Commit 86b0ae6 changed the default
value for `instance.auto_restart_policy` to `NULL` when it hasn't been
explicitly specified, so this branch has to be updated to track that.
See #6499 (comment)
hawkw added a commit that referenced this pull request Sep 23, 2024
As of #6455, instances in the `Failed` state are allowed to be
restarted. As described in [RFD 486 § 4.3], Nexus should automatically
restart such instances when the instance record's `auto_restart_policy`
field (née `boot_on_fault`, see #6499) indicates that it is permitted to
do so. This branch implements this behavior, by adding a new
`instance_reincarnation` RPW. This RPW consists of a background task
which queries CRDB for instances in the `Failed` state which are
eligible to be automatically restarted, and starts a new
`instance-start` saga for any instances it discovers.

Instances are considered eligible for reincarnation if all of the
following conditions are true:

- **The instance is in the `InstanceState::Failed` state.** Meaning, it
  must have *no active VMM* --- if the instance is in the
  `InstanceState::Vmm` state with an active VMM which is in
  `VmmState::Failed`, that indicates no instance-update saga has run to
  clean up the old VMM's resources. In this case, we must wait until the
  instance record itself has transitioned to `Failed` before attempting
  to restart it.
- **The instance's `auto_restart_policy` permits it to be restarted.**
  The `auto_restart_policy` enum is changed from representing the
  classess of failures on which an instance may be restarted (`never`,
  `sled_failures_only`, and `all_failures`) was changed to a more
  general quality-of-service for automatic restarts. Presently, this can
  either be `never`, meaning that the instance can never be restarted
  automatically; or `best_effort`, meaning that the control plane should
  try to keep the instance running but is permitted to not restart it if
  necessary to preserve the stability of the whole system. The default
  policy for instances which don't provide one is `best_effort`.
- **A cooldown period has elapsed since the instance's last automatic
  restart.**. In order to prevent instances which fail frequently from
  compromising the reliability of the system as a whole, we enforce a
  cooldown period between automatic restarts. This is tracked by setting
  a last-auto-restart timestamp in the `instance` record. At present,
  the cooldown period is always one hour unless the instance record
  overrides it, which currently can only be done using a test-only Nexus
  method. In the future, we may consider allowing the cooldown period to
  be configured by users, but because it presents a denial-of-service
  risk, setting it should probably be a more priveliged operation than
  creating an instance.

The implementation of the actual reincarnation background task is
relatively straightforward: it runs a new
`DataStore::find_reincarnatable_instances` query, which returns
instances in the `Failed` state, and it spawns `instance-start` sagas
for any instances which satisfy the above conditions. The restart
cooldown is implemented by adding a new `time_last_auto_restarted` field
to the `instance` table and to the `InstanceRuntimeState` type. We add a
`Reason` enum to the `instance-start` saga's `Params` that indicates why
the instance is being started, and if it is `Reason::AutoRestart`, then
the start saga will set the `time_last_auto_restarted` timestamp when
moving the instance record to the `Starting` state. Additionally, I've
added the `auto_restart_policy` field to the `params::InstanceCreate`
type in the Nexus API so that instances can be created with auto-restart
policies. Reviewers should note that many of the files changed in this
diff were only modified by adding that field to a
`params::InstanceCreate` literal.

This branch adds a minimal, initial implementation of instance
auto-restarting.

[RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances

Fixes #5553
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

instance.boot_on_fault column probably ought to become an enum
4 participants