Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Reincarnate Failed instances #6503

Merged
merged 111 commits into from
Sep 23, 2024
Merged

[nexus] Reincarnate Failed instances #6503

merged 111 commits into from
Sep 23, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Sep 1, 2024

As of #6455, instances in the Failed state are allowed to be restarted. As described in RFD 486 § 4.3, Nexus should automatically restart such instances when the instance record's auto_restart_policy field (née boot_on_fault, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a new instance_reincarnation RPW. This RPW consists of a background task which queries CRDB for instances in the Failed state which are eligible to be automatically restarted, and starts a new instance-start saga for any instances it discovers.

Instances are considered eligible for reincarnation if all of the following conditions are true:

  • The instance is in the InstanceState::Failed state. Meaning, it must have no active VMM --- if the instance is in the InstanceState::Vmm state with an active VMM which is in VmmState::Failed, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned to Failed before attempting to restart it.
  • The instance's auto_restart_policy permits it to be restarted. Presently, this means that only instances with the InstanceAutoRestart::AllFailures policy will be automatically restarted. The InstanceAutoRestart::SledFailuresOnly policy permits the instance to be restarted if and only if Nexus is able to definitively determine the instance failed due to a fault that effected the entire sled, and, at present, we don't record this information when transitioning an instance to Failed. Therefore, we will not currently reincarnate such instances. Future work will provide mechanisms for Nexus to determine whether an instance has transitioned to Failed due to a sled-wide failure, permitting instances with the SledFailuresOnly policy to be reincarnated depending on why they were marked as Failed.
  • A cooldown period has elapsed since the last time this instance was automatically restarted. To avoid hot loops when an instance crashes as soon as it boots, we require that a cooldown period has elapsed since the last automatic restart attempt. Currently, this period is always one minute, configured in the Nexus config file. In the future, we may want to consider making this cooldown period a project-level configuration, and we may additionally want to consider increasing the backoff duration on consecutive failures.

The implementation of the actual reincarnation background task is relatively straightforward: it runs a new DataStore::find_reincarnatable_instances query, which returns instances in the Failed state, and it spawns instance-start sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a new time_last_auto_restarted field to the instance table and to the InstanceRuntimeState type. We add a Reason enum to the instance-start saga's Params that indicates why the instance is being started, and if it is Reason::AutoRestart, then the start saga will set the time_last_auto_restarted timestamp when moving the instance record to the Starting state. Additionally, I've added the auto_restart_policy field to the params::InstanceCreate type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to a params::InstanceCreate literal.

This branch adds a minimal, initial implementation of instance auto-restarting. There's plenty of follow-up work to do when this lands:

  • Project-level configuration for auto-restart behavior. It would be nice to be able to set a default auto-restart policy at the project level, which would determine the behavior for instances which don't have explicitly set auto-restart policies. Furthermore, we may want to allow setting different cooldown periods on a per-project basis. Issue Implement Project-level defaults #1015 describes a proposal for project-level defaults, so implementing these defaults depends on that work --- which I'm happy to work on if it seems like a good next step!
  • Tracking why an instance/VMM was marked as Failed. In order to actually implement the SledFailuresOnly instance auto-restart policy, we need to...actually know when an instance was marked as Failed due to a sled-wide failure rather than an instance-specific failure. In some cases, such as when a VMM is marked as Failed due to a sled-agent returning a 404 NO_SUCH_INSTANCE error indicating that it is no longer aware of the VMM, determining this is fairly complex: was the instance forgotten because the sled-agent (or the entire sled) restarted, or was it forgotten because its individual VMM was restarted? But, in other cases, like when a VMM was marked Failed because it was on an Expunged sled (see [nexus] Mark VMMs on Expunged sleds as Failed #6519), the failure is clearly sled-wide. We should record why a VMM/instance was marked Failed, at least when it's possible to do so, so that the failure reason could be considered by the auto-restart policy.
  • Exposing auto-restart behavior in the UI. Presumably, we want to allow users to select an auto-restart policy when creating a new instance. We may also want to differentiate between instances that are in the Starting state because they were explicitly started by the user, and instances that are automatically restarting due to a failure in the UI, and we may also want to show the cooldown time when an instance is in the Failed state and is waiting to be restarted again. If @david-crespo or @charliepark want to chat about this, I'm happy to help out!

Fixes #5553

@hawkw hawkw force-pushed the eliza/boot-on-fault-isnt-just-a-bool branch from f42879c to 24b49e5 Compare September 1, 2024 20:17
@hawkw hawkw force-pushed the eliza/instance-resurrection branch from ad60b37 to d520558 Compare September 1, 2024 20:17
@hawkw hawkw force-pushed the eliza/instance-resurrection branch 2 times, most recently from 1e94b6b to 15d272f Compare September 2, 2024 18:25
hawkw added a commit that referenced this pull request Sep 2, 2024
[RFD 486] describes a principled design for how to use the Failed state
for instances and VMMs. In particular, it proposes that:

- Instances and VMMs should move to the `Failed` state independently.
- A VMM goes to `Failed` if either:
  + Nexus observes that a VMM it thought was resident on a sled is no
    longer resident there, or,
  + it observes through some other means (sled expungement, sled reboot)
    that all VMMs that were resident on that sled must now be gone.
- Instances go to `Failed` when an `instance-update` saga observes that
  an instance's active VMM has been marked `Failed`. An update saga
  transitioning an instance to failed cleans up the instance's resource
  allocations, similarly to how update sagas handle transitioning an
  instance to `Destroyed` when its active VMM has been destroyed.
- For purposes of the external API:
  + An instance that is itself in the `Failed` state appears externally
    to be `Failed`.
  + If an instance in the "has-VMM" internal state and points to a
    `Failed` VMM, the instance is `Stopping` and then transitions to
    `Failed`.
  + Once the actual instance (not just its active VMM) is `Failed`, it
    can be started or destroyed.

This branch implements the behavior described above.

In particular, I've added code to the `instance-update` saga to handle
instances whose active or migration target VMMs have transitioned to
`Failed`, similarly to how it handles the VMMs transitioning to
`Destroyed`. I've changed the code that detects whether a sled-agent
error indicates that an instance has failed to only do so when the error
is a 404 with the specific error code that indicates that the sled-agent
has forgotten about a VMM that it previously knew about, and changed the
Nexus functions that ask sled-agents to  update an instance's state, and
the `instance_watcher` background task, to mark instances as `Failed`
when they encounter only the appropriate sled-agent error. And, I've
changed the `mark_instance_failed` function in Nexus to instead mark the
VMM record as failed and trigger an instance-update saga (and it's now
named `mark_vmm_failed`), and made similar changes to the
`instance_watcher` background task. Finally, I've also made sure that
`Failed` instances can be destroyed and restarted, and that instances
with `Failed` active VMMs appear to be `Stopping` until the VMM is
cleaned up by an update saga.

In addition to this, it was necessary to (re)introduce a
`VmmState::Creating` variant. The purpose of this variant is to
differentiate between VMMs which have been created in the database but
do not yet exist on a sled, from those which are known to sled-agent and
are actually starting/migrating.

This is necessary because presently, the `instance_watcher` background
task will attempt to perform health checks for `Starting` and
`Migrating` VMMs, and --- if the sled-agent does not actually know about
the VMM yet --- will move it to `Failed`. But, when a VMM is newly
created, the sled-agent won't know about it yet until it has been
registered with the sled-agent, so if the `instance_watcher` task
activates while an `instance-start` or `instance-migrate` saga is in
flight, the `instance-watcher` task may incorrectly move its VMM to
`Failed` if the VMM record has been created but not yet registered. Now,
after adding the `Creating` variant, we can avoid performing health
checks for those VMMs until the sled-agent actually acknowledges a
request to register the VMM.

Besides the behavior implemented here, RFD 486 also proposes that the
`boot_on_fault` field in the instance record be used by Nexus to
determine whether to automatically attempt to restart a `Failed`
instance. This is described in issues #6491and #4872. I'm planning
to implement this in a separate branch, PR #6503.

[RFD 486]: https://rfd.shared.oxide.computer/rfd/0486#_determinations
@hawkw hawkw force-pushed the eliza/boot-on-fault-isnt-just-a-bool branch from 24b49e5 to 9ca4c5e Compare September 2, 2024 18:43
@hawkw hawkw force-pushed the eliza/instance-resurrection branch from d3ae274 to e061959 Compare September 2, 2024 19:03
@hawkw hawkw force-pushed the eliza/instance-resurrection branch from e061959 to 3215e3f Compare September 4, 2024 16:30
Base automatically changed from eliza/boot-on-fault-isnt-just-a-bool to main September 4, 2024 20:30
@hawkw hawkw force-pushed the eliza/instance-resurrection branch 2 times, most recently from 6e11398 to ba41dda Compare September 5, 2024 17:20
hawkw added a commit that referenced this pull request Sep 7, 2024
While I was working on #6503, I became quite taken with the way @jmpesp
had done the OMDB status for his region-replacement background tasks ---
the approach of defining a struct in `nexus-types` representing the
status JSON that both the background task implementation and OMDB could
depend on seems much nicer than constructing an untyped JSON message and
then deserializing it to a type that only exists in OMDB, because this
way, OMDB and the background task implementation are kept in sync
automatically. So, I thought it would be nice to also rework the
`abandoned_vmm_reaper` task to use this style.

This commit does that. While I was making this change, I also changed
the status JSON to include a list of all errors that occurred during the
task's activation, instead of just the last one, which will probably
make it easier to debug any issues that come up. I also added some code
for aligning the numeric columns in the OMDB output.
hawkw added a commit that referenced this pull request Sep 7, 2024
While I was working on #6503, I became quite taken with the way @jmpesp
had done the OMDB status for his region-replacement background tasks ---
the approach of defining a struct in `nexus-types` representing the
status JSON that both the background task implementation and OMDB could
depend on seems much nicer than constructing an untyped JSON message and
then deserializing it to a type that only exists in OMDB, because this
way, OMDB and the background task implementation are kept in sync
automatically. So, I thought it would be nice to also rework the
`abandoned_vmm_reaper` task to use this style.

This commit does that. While I was making this change, I also changed
the status JSON to include a list of all errors that occurred during the
task's activation, instead of just the last one, which will probably
make it easier to debug any issues that come up. I also added some code
for aligning the numeric columns in the OMDB output.
@hawkw hawkw changed the title [WIP][nexus] Reincarnate failed instances [nexus] Reincarnate Failed instances Sep 10, 2024
@hawkw
Copy link
Member Author

hawkw commented Sep 10, 2024

I intend to do some manual testing on london to make sure this works end-to-end before I'll be comfortable merging it. But, I'm going to go ahead and mark the PR as "ready for review" to start getting eyes on it in the meantime.

@hawkw hawkw marked this pull request as ready for review September 10, 2024 17:27
@hawkw hawkw self-assigned this Sep 10, 2024
@hawkw
Copy link
Member Author

hawkw commented Sep 10, 2024

I screwed up the test here a bit and am in the process of redoing it, please hold...

@hawkw
Copy link
Member Author

hawkw commented Sep 10, 2024

OH YEAH IT WORKS!

Creating an instance (using the manual api command so that I can add an auto-restart policy, which the CLI does not know about yet):

eliza@theseus ~/Code/oxide/oxide.rs $ ./cli.sh api '/v1/instances?project=test' --method POST --input - <<EOF
{
    "name": "restartme",
    "description": "my cool instance",
    "hostname": "restartme",
    "memory": 34359738368,
    "ncpus": 2,
    "disks": [
        {
            "type": "attach",
            "name": "restartme-boot"
        }
    ],
    "network_interfaces": {
        "type": "default"
    },
    "external_ips": [
        {
            "type": "ephemeral",
            "pool_name": "default"
        }
    ],
    "user_data": "",
    "start": true,
    "auto_restart_policy": { "type": "all_failures" }
}
EOF
{
  "description": "my cool instance",
  "hostname": "restartme",
  "id": "76d6b13a-58f7-4b80-adcd-96ff3510485e",
  "memory": 34359738368,
  "name": "restartme",
  "ncpus": 2,
  "project_id": "0610d1e0-36f6-4530-b44d-b283189676eb",
  "run_state": "starting",
  "time_created": "2024-09-10T21:53:31.913311Z",
  "time_modified": "2024-09-10T21:53:31.913311Z",
  "time_run_state_updated": "2024-09-10T21:53:34.885879Z"
}

Thhe instance has started, and we can see it in omdb db instances:

eliza@theseus ~/Code/oxide/oxide.rs $ ssh londonswitch
The illumos Project     helios-2.0.22860        September 2024
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (97.0.0)
ID                                   NAME      STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
1c769a92-6769-4c9e-bb62-3fadee30de01 normal    running a994a91a-a0b0-4159-ad57-7131f1853671 d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
76d6b13a-58f7-4b80-adcd-96ff3510485e restartme running f6d80dd9-1e9c-4ef7-a5ae-9b805e8e1aab cf8ee87e-0858-48e9-964a-44a705113b1c BRM42220036

I pull the (metaphorical) plug on the sled that instance is running on:

root@oxz_switch1:~# pilot sp cycle BRM42220036
14  BRM42220036        ok

Was hoping to actually catch all the state transitions here, but unfortunately, it turns out that omdb db instances times out a lot when you kill a sled that turns out to have a CRDB replica on it?

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable


WARN: failed to query schema version: Service Unavailable: Failed to access DB connection: Request timed out
It's possible the database is running a version that's different from what this
tool understands.  This may result in errors or incorrect output.
^X^C

But, we can still see it come back, on a different sled (and, I caught the RUNNING -> STOPPING -> STARTING -> RUNNING) transition in the web UI! :D):

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (97.0.0)
ID                                   NAME      STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
76d6b13a-58f7-4b80-adcd-96ff3510485e restartme running 876279ed-496c-4793-9cf4-7cbe9a9658fd d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
1c769a92-6769-4c9e-bb62-3fadee30de01 normal    running a994a91a-a0b0-4159-ad57-7131f1853671 d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
root@oxz_switch1:~#

labbott and others added 7 commits September 21, 2024 12:31
Connect over the bootstrap network with sprockets tls
Change the process of soft-deleting a volume from a CTE into a
transaction. This is done to make changes easier to apply and review.

One part of the CTE that remains is the query fragment that
conditionally reduces volume references for region_snapshot rows, then
returns only the rows updated. It was easier to keep this than figure
out how to get it to work in diesel!
Create new enum types and return those to give more information to
callers:

- `create_region_snapshot_replacement_step` and
`insert_region_snapshot_replacement_step` now return
`InsertRegionSnapshotReplacementStepResult`

- `volume_replace_region` and `volume_replace_snapshot` now return
`VolumeReplaceResult`

Notably, `VolumeReplaceResult::ExistingVolumeDeleted` replaces the
previous error type `TargetVolumeDeleted`, and is not an error, allowing
the caller to take action of the existing volume was deleted.

This commit was peeled off work in progress to address #6353.
Presently, `omdb` can be used to query the database for instance records
using the `omdb instances` command. However, this command does not
output all the database state associated with an instance. Instead, it
displays the instance's name, UUID, effective state, active Propolis
UUID, sled UUID, and sled serial (the last three only if the instance is
running).

This is some of the most useful information about a customer instance,
but it's not *all* of the information we have about it. Debugging issues
related to instance state management may require other information from
the instance record, as well as from the instance's active VMM record,
and its target VMM and active migration records (if it is migrating).
Adding all of these additional details to the `omdb db instances`
command is a bit of a non-starter, though. Since that command outputs a
table of all instances, with one line per instance, the output is
already *much* wider than 80 columns, and requires a pretty wide
terminal to be readable --- once the table has line wrapped, it's much
harder to interpret. Therefore, I've added an `omdb db instance info`
command that outputs a complete dump of the state of a *single*
instance. Since this command is no longer constrained to putting all
salient details about the instance on a single line in a table, we can
show every field in the instance's DB representation without having to
worry about wrapping.

Output from `omdb db instance info` looks like this (yes, I spent
entirely too long tweaking the formatting 😅):

```console
oot@oxz_switch1:~# /var/tmp/omdb-eliza-test-15 db instance show fff916c4-814a-48a0-a04a-f887e6f81a9e
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (98.0.0)

INSTANCE
                     ID: fff916c4-814a-48a0-a04a-f887e6f81a9e
             project ID: fe0da422-5c48-4b52-8010-f2fc401f090f
                   name: app-13
            description: application instance
             created at: 2024-06-22 00:31:43.889412 UTC
       last modified at: 2024-06-22 00:31:43.889412 UTC
/!\          deleted at: 2024-06-22 00:43:59.156149 UTC

CONFIGURATION
                  vCPUs: 1
                 memory: 2 GiB
               hostname: node13
    auto-restart policy: None

RUNTIME STATE
            nexus state: Destroyed
(i)  external API state: Destroyed
        last updated at: 2024-06-22T00:43:49.440274Z (generation 5)
          active VMM ID: None
          target VMM ID: None
           migration ID: None
           updater lock: UNLOCKED at generation: 1
root@oxz_switch1:~# /var/tmp/omdb-eliza-test-15 db instance show 480c6167-dac2-4df7-808d-c9e776c1c312
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (98.0.0)

INSTANCE
                     ID: 480c6167-dac2-4df7-808d-c9e776c1c312
             project ID: fe0da422-5c48-4b52-8010-f2fc401f090f
                   name: mysql-3
            description: MySQL Instance
             created at: 2024-09-16 00:43:10.264333 UTC
       last modified at: 2024-09-16 00:43:10.264333 UTC

CONFIGURATION
                  vCPUs: 4
                 memory: 16 GiB
               hostname: mysql-3
    auto-restart policy: None

RUNTIME STATE
            nexus state: Vmm
(i)  external API state: Stopping
        last updated at: 2024-09-19T23:15:50.698207Z (generation 5)
          active VMM ID: Some(51babee3-0ee5-4f67-bcbb-2189a39479a3)
          target VMM ID: None
           migration ID: None
           updater lock: UNLOCKED at generation: 4

      active VMM record:
          Vmm {
              id: 51babee3-0ee5-4f67-bcbb-2189a39479a3,
              time_created: 2024-09-19T23:15:50.647454Z,
              time_deleted: None,
              instance_id: 480c6167-dac2-4df7-808d-c9e776c1c312,
              sled_id: 71def415-55ad-46b4-ba88-3ca55d7fb287,
              propolis_ip: V6(
                  Ipv6Network {
                      addr: fd00:1122:3344:10b::1:4be,
                      prefix: 128,
                  },
              ),
              propolis_port: SqlU16(
                  12400,
              ),
              runtime: VmmRuntimeState {
                  time_state_updated: 2024-09-19T23:18:03.015643Z,
                  gen: Generation(
                      Generation(
                          4,
                      ),
                  ),
                  state: Stopping,
              },
          }
root@oxz_switch1:~#
```
Adds a module in `oximeter_db` that speaks the native ClickHouse
protocol, a message-oriented protocol directly over TCP. This includes
the core messaging and serde needed for making a handshake with the
server, and then running most standard SQL queries.

This uses the codecs from `tokio_util` to turn the TCP stream into a
sink / stream of messges, which is much more convenient to operate on.
The client and server exchange a handshake, after which the client may
run SQL queries to select data.

There are a number of tests in the module, including low-level
serialization checks and asserting the results of actual SQL queries. In
addition, this adds a barebones SQL shell -- this is not intended to
replace the feature-rich official CLI, but is very useful for testing
the protocol implementation by running arbitrary queries.

This module is not consumed anywhere beyond the shell itself. A
follow-up commit will integrate it into the existing
`oximeter_db::Client` object, with the primary goal of making OxQL
queries much more flexible and efficient.
A follow up PR will use this for clickhouse keeper allocation.
@hawkw hawkw merged commit c2e718f into main Sep 23, 2024
17 checks passed
@hawkw hawkw deleted the eliza/instance-resurrection branch September 23, 2024 18:13
hawkw added a commit that referenced this pull request Sep 23, 2024
Currently, the `instance-update` saga's `siu_commit_instance_updates`
saga attempts to check if an additional update saga is needed and start
one if so. This is done in the same action that writes back the updated
instance record. In [this comment][1] on PR #6503, @davepacheco pointed
out that conflating these two operations makes the idempotency tests for
the update saga less effective, since the action performs multiple
operations (even if some of them are permitted to fail). This commit
refactors the instance-update saga so that the chaining operation is
performed in a separate action from the rest of the saga.

There should be no functional change to the saga's behavior as a result
of this, beyond making the idempotency tests more correct.

[1]:
#6503 (comment)
@davepacheco davepacheco added this to the 11 milestone Sep 24, 2024
hawkw added a commit that referenced this pull request Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
hawkw added a commit that referenced this pull request Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
hawkw added a commit that referenced this pull request Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640
Fixes #2825, along with #6455 (which allowed restarting `Failed`
instances).
hawkw added a commit that referenced this pull request Sep 27, 2024
When an `instance-start` saga unwinds, any VMM it created transitions to
the `SagaUnwound` state. This causes the instance's effective state to
appear as `Failed` in the external API. PR #6503 added functionality to
Nexus to automatically restart instances that are in the `Failed` state
("instance reincarnation"). However, the current instance-reincarnation
task will _not_ automatically restart instances whose instance-start
sagas have unwound, because such instances are not actually in the
`Failed` state from Nexus' perspective.

This PR implements reincarnation for instances whose `instance-start`
sagas have failed. This is done by changing the `instance_reincarnation`
background task to query the database for instances which have
`SagaUnwound` active VMMs, and then run `instance-start` sagas for them
identically to how it runs start sagas for `Failed` instances.

I decided to perform two separate queries to list `Failed` instances and
to list instances with `SagaUnwound` VMMs, because the `SagaUnwound`
query requires a join with the `vmm` table, and I thought it was a bit
nicer to be able to find `Failed` instances without having to do the
join, and only do it when looking for `SagaUnwound` ones. Also, having
two queries makes it easier to distinguish between `Failed` and
`SagaUnwound` instances in logging and the OMDB status output. This
ended up being implemented by adding a parameter to the
`DataStore::find_reincarnatable_instances` method that indicates which
category of instances to select; I had previously considered making the
method on the `InstanceReincarnation` struct that finds instances and
reincarnates them take the query as a `Fn` taking the datastore and 
`DataPageParams` and returning an `impl Future` outputting
`Result<Vec<Instance>, ...>`,but figuring out generic lifetimes for the 
pagination stuff was annoying enough that this felt like the simpler
choice.

Fixes #6638
hawkw added a commit that referenced this pull request Oct 1, 2024
This commit extends the `instance-reconfigure` API endpoint added in
#6585 to also allow setting instance auto-restart policies (as added in
#6503).

I've also added the actual auto-restart policy to the external API
instance view, along with the boolean `auto_restart_enabled` added in
#6503. This way, it's possible to change just the boot disk by providing
the current auto-restart policy in an instance POST.
hawkw added a commit that referenced this pull request Oct 2, 2024
This commit extends the `instance-reconfigure` API endpoint added in
#6585 to also allow setting instance auto-restart policies (as added in
#6503).

I've also added the actual auto-restart policy to the external API
instance view, along with the boolean `auto_restart_enabled` added in
#6503. This way, it's possible to change just the boot disk by providing
the current auto-restart policy in an instance POST.
hawkw added a commit that referenced this pull request Oct 3, 2024
This commit extends the `instance-reconfigure` API endpoint added in
#6585 to also allow setting instance auto-restart policies (as added in
#6503).

I've also added the actual auto-restart policy to the external API
instance view, along with the boolean `auto_restart_enabled` added in
#6503. This way, it's possible to change just the boot disk by providing
the current auto-restart policy in an instance POST.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sled expungement must clean up failed instances
8 participants