restart customer Instances after sled reboot #3633

davepacheco · 2023-07-14T21:50:46Z

I haven't verified this but after talking with @smklein we believe that if a sled reboots, any customer Instances that were running on that system will no longer be running (not there, nor anywhere). But the API state will probably reflect that they are still running. It's not clear if there'd be any way to get them running again.

Part of the design here was that the sled_agent_put() call from the Sled Agent to Nexus would be an opportunity for Nexus to verify that the expected Instances were still running. In practice, this probably needs to trigger an RFD 373-style RPW that determines what's supposed to be on each Sled, what is running on each sled, and fixes things appropriately. It might be cleanest to factor that into two RPWs:

one RPW: checks what's supposed to be running on each Sled, checks what's running there, and for any discrepancies, marks the Instance failed (or something like that)
second RPW: for each failed Instance, try to start it elsewhere

There's a related issue here around sleds that have failed more permanently. I'd suggest we treat this as a different kind of thing and not try to automatically detect this using a heartbeat mechanism or something like that. That kind of automation can make things worse. For this case (which really should be rare), we could require that an operator mark the sled as "permanently gone -- remove it from the cluster", after which we mark its Instances failed.

The text was updated successfully, but these errors were encountered:

askfongjojo · 2023-07-21T00:46:05Z

One thing that came up in my conversation with our customer onsite is the ability for them to specify the autoboot behavior. If we can expose that as a user-configurable option, we don't have to make the decision for them on whether to bring up an instance when a sled-agent comes back up. Ideally:

If autoboot is set to true, automatically recreate the propolis zone and start the instance.
If autoboot is set to false, mark the instance as stopped in CRDB (plus any other necessary cleanup as if the instance was stopped by the user).
If a zone is stuck in starting or stopping state, user or sled-agent has the ability to move it to failed state. This desired behavior is something not strictly related to sled/sled-agent reboot - it's useful even for an instance that doesn't get into running state for whatever reason. The biggest pain in this situation is that the disks attached to the instance are stuck and cannot be used any more (i.e. it's considered data loss from a user perspective).

askfongjojo · 2023-07-21T01:16:56Z

we could require that an operator mark the sled as "permanently gone -- remove it from the cluster", after which we mark its Instances failed.

A thought related to this different scenario: perhaps we can also mark the instance stopped and reset the active_sled_id and active_propolis_id in CRDB? This is assuming that we wire in the mechanism to pick a sled when starting an instance that has a NULL value in these attributes. Or we'd always blank out the sled/propolis ids as part of the process of stopping instances, as discussed in #2315.

To put things in perspective, I'd suggest that you substitute "customer instance" with "buildomat" and imagine how you'd want sled-agent to handle it in the scenarios we've discussed so far in this ticket (i.e. sled reboot, sled gone, instance staying in failed/starting/stopping status). 😅

gjcolombo · 2023-07-21T03:04:09Z

Some drive-by commentary:

I haven't verified this but after talking with @smklein we believe that if a sled reboots, any customer Instances that were running on that system will no longer be running (not there, nor anywhere). But the API state will probably reflect that they are still running. It's not clear if there'd be any way to get them running again.

This is right AFAIK--the zones are gone, the instances aren't running, and there's no way to restore them to exactly the state they had when the sled rebooted. They can be cold-booted onto the same sled, but sled agent will need to be told to do this (it won't come back up and realize "oh hey I was running such-and-such instances here" and automatically restart them).

One thing that came up in my conversation with our customer onsite is the ability for them to specify the autoboot behavior.

I strongly agree with this--this should be configurable, if not now then in the (relatively) near future. With the caveat that I know basically nothing about Buildomat's internals, I can easily see it being an example of a sort of system where you wouldn't necessarily want a VM to come back up automatically if its sled reboots: Buildomat scheduler creates a VM; scheduler sends agent on the VM a set of commands; VM's sled reboots; scheduler decides the job is unresponsive and gives up on it; if the agent is then just sitting there waiting for commands, you've got a zombie VM. I can imagine enough workloads of this kind (i.e. where I don't want the VM to start unless I'm there to tell it to do something) that I feel pretty strongly that this behavior should be configurable.

If a zone is stuck in starting or stopping state, user or sled-agent has the ability to move it to failed state. This desired behavior is something not strictly related to sled/sled-agent reboot - it's useful even for an instance that doesn't get into running state for whatever reason. The biggest pain in this situation is that the disks attached to the instance are stuck and cannot be used any more (i.e. it's considered data loss from a user perspective).

FWIW there's now a sled agent API (instance_unregister) that forcibly terminates and unregisters an instance's zone irrespective of its prior state. There's currently no way to trigger it explicitly from Nexus (it's only invoked in the undo steps for sagas that rely on registering an instance), so we'd have to add that (and properly explain the semantics of that operation).

jclulow · 2023-07-21T05:46:37Z

I think it's probably better not to think of this as "autoboot" per se, as we're asking the user to think of the whole rack (and eventually a fleet of racks) as "the computer". From that perspective an individual sled rebooting is more like an individual disk or DIMM failing: that the sled "boots up" is an implementation detail.

Rather, if we expose a per-instance property I think it should be more explicitly about what to do after a fault which interrupts the instance. Something like "on_fault" which could have an initial choice of "restart" or "none" or something like that.

Additionally, I think it's important that we not consider customer instances as living on a particular sled. If they're restarted after a fault, they should go back through the regular instance placement process (in Nexus) that occurred when they were initially started, potentially ending up on a different sled.

jclulow · 2023-07-21T05:49:56Z

As an aside, buildomat uses a lot of AWS machines today, and we maintain a local catalogue of instances that we intend to create and have successfully been able to create. On instance boot we also register with the buildomat central API, which, if it occurs again later, we can use as a signal that something has gone terribly wrong with a particular instance.

We're also able to detect through listing instances any which are surplus to requirements and clean them up. I think you pretty much have to do all this if you're managing infrastructure in an EC2-like cloud environment and intending not to leak things.

hawkw · 2024-09-24T17:24:50Z

Implemented in #6503.

davepacheco added this to the MVP milestone Jul 14, 2023

smklein mentioned this issue Jul 21, 2023

Tracking: Instance Lifecycle Overhaul #3742

Open

13 tasks

askfongjojo modified the milestones: MVP, 1.0.2 Aug 2, 2023

iliana added the known issue To include in customer documentation and training label Aug 11, 2023

askfongjojo modified the milestones: 1.0.2, 1.0.3 Aug 22, 2023

askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023

askfongjojo modified the milestones: 3, 4 Oct 21, 2023

morlandi7 modified the milestones: 4, 5 Nov 14, 2023

morlandi7 modified the milestones: 5, 6 Dec 4, 2023

askfongjojo mentioned this issue Dec 5, 2023

instances created on a build with issue #4512 are unstoppable #4558

Closed

askfongjojo mentioned this issue Jan 2, 2024

Support API for stopping all running instances in a certain sled or all sleds #4746

Open

davepacheco mentioned this issue Jan 23, 2024

need a way to trigger cleanup and next steps for vanished instances #4872

Closed

morlandi7 modified the milestones: 6, 7 Jan 25, 2024

askfongjojo modified the milestones: 7, 9 Mar 9, 2024

davepacheco mentioned this issue Apr 17, 2024

sled expungement must clean up failed instances #5553

Closed

morlandi7 modified the milestones: 9, 10 Jul 1, 2024

morlandi7 modified the milestones: 10, 11 Aug 14, 2024

hawkw closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart customer Instances after sled reboot #3633

restart customer Instances after sled reboot #3633

davepacheco commented Jul 14, 2023

askfongjojo commented Jul 21, 2023 •

edited

Loading

askfongjojo commented Jul 21, 2023 •

edited

Loading

gjcolombo commented Jul 21, 2023

jclulow commented Jul 21, 2023

jclulow commented Jul 21, 2023 •

edited

Loading

hawkw commented Sep 24, 2024

restart customer Instances after sled reboot #3633

restart customer Instances after sled reboot #3633

Comments

davepacheco commented Jul 14, 2023

askfongjojo commented Jul 21, 2023 • edited Loading

askfongjojo commented Jul 21, 2023 • edited Loading

gjcolombo commented Jul 21, 2023

jclulow commented Jul 21, 2023

jclulow commented Jul 21, 2023 • edited Loading

hawkw commented Sep 24, 2024

askfongjojo commented Jul 21, 2023 •

edited

Loading

askfongjojo commented Jul 21, 2023 •

edited

Loading

jclulow commented Jul 21, 2023 •

edited

Loading