[nexus] Turn `instance.boot_on_fault` into an enum #6499

hawkw · 2024-08-31T16:05:12Z

Currently, the instance table has a boot_on_fault column, which is a
bool. This is intended to indicate whether an instance should be
automatically restarted in response to failures, although currently, it
isn't actually consumed by anything. However, as determined in RFD 486,
Nexus will eventually grow functionality for automatically restarting
instances that have transitioned to the Failed state, if
boot_on_fault is set.

@gjcolombo suggests that this field should probably be replaced with an
enum, rather than a bool. This way, we could represent more
boot-on-fault disciplines than "always reboot on faults" and "never do
that". Since nothing is currently consuming this field, it's probably a
good idea to go ahead and do the requisite SQL surgery to turn it into
an enum now, before we write code that actually consumes
it...especially since I'm planning on actually doing that next (see
#6491).

This commit replaces the boot_on_fault column with a new
auto_restart_policy column. The type for this column is a newly-added
InstanceAutoRestart enum, which currently has variants for
AllFailures (the instance should be automatically restarted any time
it fails) and Never (the instance should never be automatically
restarted). The database migration adding auto_restart_policy
backfills any existing boot_on_fault: true instances with
InstanceAutoRestart::AllFailures, and boot_on_fault: false with
InstanceAutoRestart::Never.

Closes #6490

hawkw · 2024-08-31T16:05:37Z

Most of this change is pretty straightforward; while I welcome input from reviewers on what to name the new fields, most of the complexity is just the database migration to replace boot_on_fault with auto_restart_policy. I'm pretty sure I've done that correctly, but I'd love extra attention on that part.

hawkw · 2024-08-31T16:43:24Z

Note that this PR is currently stacked on top of #6455, because I wanted to start working on a change that depends on both of them.

hawkw · 2024-09-01T18:52:03Z

Hmm, I'm not sure whether this CI failure is my fault...will have to investigate deeper: https://buildomat.eng.oxide.computer/wg/0/details/01J6Q8748MZKW18KSEF4540CQZ/iy5XpuAznwyLAetcf2Mfjh1o9XO11hzknMn8Mr0yhl9xfXK3/01J6Q87HTCRRS09RECVKRHBBSR#S5113

hawkw · 2024-09-02T16:45:38Z

Hmm, I'm not sure whether this CI failure is my fault...will have to investigate deeper: https://buildomat.eng.oxide.computer/wg/0/details/01J6Q8748MZKW18KSEF4540CQZ/iy5XpuAznwyLAetcf2Mfjh1o9XO11hzknMn8Mr0yhl9xfXK3/01J6Q87HTCRRS09RECVKRHBBSR#S5113

Rebased the branch and it went away, presumably this was a flake: #6506

hawkw · 2024-09-02T17:28:06Z

Reading over #4872, it occurs to me that we may not actually want the default InstanceAutoRestart policy to be Never --- maybe the default that we backfill existing boot_on_fault: false instances with ought to be "restart the instance if the sled it was on rebooted, but not if the propolis-server process crashed"? So, maybe we should have variants like:

enum InstanceAutoRestart {
   Never,
   #[default]
   SledFailuresOnly,
   AllFailures,
}

Unfortunately, we don't currently have a way for Nexus differentiate between instances that have disappeared because their whole sled rebooted and instances that disappeared because their individual VMM crashed, so actually implementing the "sled failures only" policy will require a bit more work. But, maybe we should still make it the default now so that existing instances get that behavior once we can actually implement it?

cc @gjcolombo @davepacheco what do you think?

hawkw · 2024-09-02T18:57:53Z

Reading over #4872, it occurs to me that we may not actually want the default InstanceAutoRestart policy to be Never --- maybe the default that we backfill existing boot_on_fault: false instances with ought to be "restart the instance if the sled it was on rebooted, but not if the propolis-server process crashed"? So, maybe we should have variants like:
enum InstanceAutoRestart {
   Never,
   #[default]
   SledFailuresOnly,
   AllFailures,
}
...

I went ahead and did this in 2db6eff, because I think that if we want to implement auto-restart on sled reboots, and we want it to be the default behavior, we ought to make sure it's the default now for backfilling the auto-restart policy on current instance records. If others disagree that this is the right default, I'm happ yto undo that change.

2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)

gjcolombo · 2024-09-03T16:27:09Z

I think the right default is whatever users will find least surprising :)

Going into this, I would've said that "never" is the right default, because my mental model is roughly "I should have to command the control plane to start a stopped instance," and if I don't enable auto-restart for a failed instance, then I have issued no such command. But I suspect I'm in the minority here and that most users will be less surprised by a model that says, "once you've asked for an instance to be running, the control plane tries to keep it running until you ask it not to be running anymore." (FWIW this appears to be the default model in the big three public clouds.)

So, all told: I think I'm the weird one here and am OK with making "reboot on host failure" the default option.

One tangential note about rebooting on VMM failure: I'm fine with having the enum variant here, but before we wire it up and expose it, I'd like to understand how we recover if some instance gets stuck in a panic loop (i.e. where something is wrong with a specific instance that makes Propolis deterministically panic when trying to start it). Such an issue would be a Propolis bug, of course. We don't have to (and probably shouldn't) try to solve that in this thread, but I'd like to put a pin in it.

hawkw · 2024-09-03T17:20:49Z

I think the right default is whatever users will find least surprising :)

Going into this, I would've said that "never" is the right default, because my mental model is roughly "I should have to command the control plane to start a stopped instance," and if I don't enable auto-restart for a failed instance, then I have issued no such command. But I suspect I'm in the minority here and that most users will be less surprised by a model that says, "once you've asked for an instance to be running, the control plane tries to keep it running until you ask it not to be running anymore." (FWIW this appears to be the default model in the big three public clouds.)

So, all told: I think I'm the weird one here and am OK with making "reboot on host failure" the default option.

FWIW, my initial thinking was also that "never" should be the default, because I don't love the idea of implicitly going and doing stuff the user hasn't explicitly asked for. But, yeah, thinking about it, isn't the whole point of "the cloud" that the user should be insulated from host failures? You're absolutely correct that the major public clouds don't generally require you to manually restart your VMMs if one of their hosts reboots --- in many cases, they don't even expose host reboots to the user at all!

I also think that, if instance.boot_on_fault was actually exposed to the user when creating an instance, I would be more comfortable with making "never" the default --- if instances with boot_on_fault: false were that way because the user had not checked a "reboot on fault" box when creating the instance, maybe backfilling auto-restart policies should only enable boot on fault for instances where the user did request it. But all instances currently running on user racks have boot_on_fault: false because (AFAICT) there was no way for users to request otherwise...

One tangential note about rebooting on VMM failure: I'm fine with having the enum variant here, but before we wire it up and expose it, I'd like to understand how we recover if some instance gets stuck in a panic loop (i.e. where something is wrong with a specific instance that makes Propolis deterministically panic when trying to start it). Such an issue would be a Propolis bug, of course. We don't have to (and probably shouldn't) try to solve that in this thread, but I'd like to put a pin in it.

Yeah, in the long term, I think we probably want our restart policies to include a limit on the number of consecutive automatic restarts due to the same class of failure. But, depending on how granular we want those limits to be, we probably want some notion of "failure reasons" in the control plane...that could just be separating "single VMM failures" from "sled rebooted", which will be necessary in order to actually implement the "sled failures only" policy anyway...

iximeow · 2024-09-03T17:31:01Z

most users will be less surprised by a model that says, "once you've asked for an instance to be running, the control plane tries to keep it running until you ask it not to be running anymore."

i'd agree with this in a different way: rebooting on sled failure feels morally similar to configuring a machine to power on after power loss is detected. that, in turn, is a pretty common intentional choice IME?

gjcolombo · 2024-09-03T17:40:42Z

I also think that, if instance.boot_on_fault was actually exposed to the user when creating an instance, I would be more comfortable with making "never" the default --- if instances with boot_on_fault: false were that way because the user had not checked a "reboot on fault" box when creating the instance, maybe backfilling auto-restart policies should only enable boot on fault for instances where the user did request it. But all instances currently running on user racks have boot_on_fault: false because (AFAICT) there was no way for users to request otherwise...

This raises another question, I think. The existing implicit setting here is "never." Currently we have no instance-update API and so have no way for a user to change the setting. Should we wait to change the behavior for existing instances until users have a way to change it back? (There are other good reasons to have an instance update API, e.g. #3769, so there are lots of additional rewards to be gained from that particular side quest.)

hawkw · 2024-09-03T18:36:15Z

I also think that, if instance.boot_on_fault was actually exposed to the user when creating an instance, I would be more comfortable with making "never" the default --- if instances with boot_on_fault: false were that way because the user had not checked a "reboot on fault" box when creating the instance, maybe backfilling auto-restart policies should only enable boot on fault for instances where the user did request it. But all instances currently running on user racks have boot_on_fault: false because (AFAICT) there was no way for users to request otherwise...

This raises another question, I think. The existing implicit setting here is "never." Currently we have no instance-update API and so have no way for a user to change the setting. Should we wait to change the behavior for existing instances until users have a way to change it back? (There are other good reasons to have an instance update API, e.g. #3769, so there are lots of additional rewards to be gained from that particular side quest.)

That is fair, perhaps we should default to "never", allow selecting an alternative policy for creating new instances, and then add an API to change the setting for existing instances. Leaving the default as "never" does seem like the safest choice, but until we add an instance-update API¹, this will force users to delete existing instances and recreate them just to get the "restart automatically if the sled reboots" behavior, which I do still think is probably what a substantial majority of users will want for a substantial majority of instances...

Which...we probably should call something other than that, given that "instance update" now means a very different thing. Maybe "instance modify" or something? ↩

hawkw · 2024-09-03T21:07:42Z

@gjcolombo okay, as we discussed, I've gone back to making Never the default for now. We can figure out how to specify reboot policies for new instances later.

davepacheco · 2024-09-03T22:50:06Z

For what it's worth, my expectations match @iximeow's: this is analogous to power-on-after-power-loss to me. I'd be surprised if a VM I had provisioned was just not running one day because of a server reboot. Also, I think the auto-restart behavior can help smooth over other problems -- e.g., one day we will have some problem where sleds panic occasionally, and it will be much worse if end users have to take action whenever that happens to keep their stuff running.

I do think it'll be important to deal with the consecutive-failed-start problem. That could easily become a DoS (intentional or otherwise) even if it's just a propolis crash. Imagine if a bug causes the VM to trigger a host OS kernel panic. We had more than one issue like this at Joyent, where some workload triggered an OS panic, so the system moved it to another host, which triggered another panic, .... It was ugly. I'd consider a very simple "no more than X restarts in Y minutes" policy, or even "no more than one restart within X minutes", which would only require storing one time_last_autostart_attempted timestamp.

What do you think about having a database value called Unset (or allowing it to be NULL)? That would allow future-us to tell the difference between instances whose policy was explicitly set by the user (which we'd never want to change) vs. ones that we set based on what we thought would make a good default. More to the point: that would allow us to say that right now the behavior is that if you haven't set this, we don't restart anything (for the reasons mentioned about not having a way to change it and not wanting to break existing behavior) but in a future release (when we've fixed that) we could change the default policy while still honoring anything that somebody had set explicitly. Is that overdoing it?

hawkw · 2024-09-03T23:09:47Z

What do you think about having a database value called Unset (or allowing it to be NULL)? That would allow future-us to tell the difference between instances whose policy was explicitly set by the user (which we'd never want to change) vs. ones that we set based on what we thought would make a good default. More to the point: that would allow us to say that right now the behavior is that if you haven't set this, we don't restart anything (for the reasons mentioned about nothaving a way to change it and not wanting to break existing behavior) but in a future release (when we've fixed that) we could change the default policy while still honoring anything that somebody had set explicitly. Is that overdoing it?

This seems like a pretty appealing compromise, honestly --- I suppose there's no actual reason we have to backfill it except for "it would be nice to make the column non-null", but...it would also be fine to leave it NULL for now, and if we decided to commit to a default behavior later, we could then go back and backfill the NULLs with that default...or not do that. You've sold me!

hawkw · 2024-09-04T16:20:09Z

Okay, 86b0ae6 makes the auto-restart policy nullable, and leaves it NULL for all instances where boot_on_fault is not true (which, in practice, should be all of 'em). Let me know what y'all think!

2db6eff added a `SledFailuresOnly` auto-restart policy in addition to `Never` and `AllFailures`. I discussed the rationale for that in [this comment][1]. Currently, there isn't a mechanism to detect whether an instance is `Failed` because the individual instance crashed or because the whole sled was restarted, so for now, we assume all failures are instance-level. But, we still need to handle the new variant. [1]: #6499 (comment)