[nexus] Reincarnate `Failed` instances #6503

hawkw · 2024-09-01T19:23:19Z

As of #6455, instances in the Failed state are allowed to be restarted. As described in RFD 486 § 4.3, Nexus should automatically restart such instances when the instance record's auto_restart_policy field (née boot_on_fault, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a new instance_reincarnation RPW. This RPW consists of a background task which queries CRDB for instances in the Failed state which are eligible to be automatically restarted, and starts a new instance-start saga for any instances it discovers.

Instances are considered eligible for reincarnation if all of the following conditions are true:

The instance is in the InstanceState::Failed state. Meaning, it must have no active VMM --- if the instance is in the InstanceState::Vmm state with an active VMM which is in VmmState::Failed, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned to Failed before attempting to restart it.
The instance's auto_restart_policy permits it to be restarted. Presently, this means that only instances with the InstanceAutoRestart::AllFailures policy will be automatically restarted. The InstanceAutoRestart::SledFailuresOnly policy permits the instance to be restarted if and only if Nexus is able to definitively determine the instance failed due to a fault that effected the entire sled, and, at present, we don't record this information when transitioning an instance to Failed. Therefore, we will not currently reincarnate such instances. Future work will provide mechanisms for Nexus to determine whether an instance has transitioned to Failed due to a sled-wide failure, permitting instances with the SledFailuresOnly policy to be reincarnated depending on why they were marked as Failed.
A cooldown period has elapsed since the last time this instance was automatically restarted. To avoid hot loops when an instance crashes as soon as it boots, we require that a cooldown period has elapsed since the last automatic restart attempt. Currently, this period is always one minute, configured in the Nexus config file. In the future, we may want to consider making this cooldown period a project-level configuration, and we may additionally want to consider increasing the backoff duration on consecutive failures.

The implementation of the actual reincarnation background task is relatively straightforward: it runs a new DataStore::find_reincarnatable_instances query, which returns instances in the Failed state, and it spawns instance-start sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a new time_last_auto_restarted field to the instance table and to the InstanceRuntimeState type. We add a Reason enum to the instance-start saga's Params that indicates why the instance is being started, and if it is Reason::AutoRestart, then the start saga will set the time_last_auto_restarted timestamp when moving the instance record to the Starting state. Additionally, I've added the auto_restart_policy field to the params::InstanceCreate type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to a params::InstanceCreate literal.

This branch adds a minimal, initial implementation of instance auto-restarting. There's plenty of follow-up work to do when this lands:

Project-level configuration for auto-restart behavior. It would be nice to be able to set a default auto-restart policy at the project level, which would determine the behavior for instances which don't have explicitly set auto-restart policies. Furthermore, we may want to allow setting different cooldown periods on a per-project basis. Issue Implement Project-level defaults #1015 describes a proposal for project-level defaults, so implementing these defaults depends on that work --- which I'm happy to work on if it seems like a good next step!
Tracking why an instance/VMM was marked as Failed. In order to actually implement the SledFailuresOnly instance auto-restart policy, we need to...actually know when an instance was marked as Failed due to a sled-wide failure rather than an instance-specific failure. In some cases, such as when a VMM is marked as Failed due to a sled-agent returning a 404 NO_SUCH_INSTANCE error indicating that it is no longer aware of the VMM, determining this is fairly complex: was the instance forgotten because the sled-agent (or the entire sled) restarted, or was it forgotten because its individual VMM was restarted? But, in other cases, like when a VMM was marked Failed because it was on an Expunged sled (see [nexus] Mark VMMs on Expunged sleds as Failed #6519), the failure is clearly sled-wide. We should record why a VMM/instance was marked Failed, at least when it's possible to do so, so that the failure reason could be considered by the auto-restart policy.
Exposing auto-restart behavior in the UI. Presumably, we want to allow users to select an auto-restart policy when creating a new instance. We may also want to differentiate between instances that are in the Starting state because they were explicitly started by the user, and instances that are automatically restarting due to a failure in the UI, and we may also want to show the cooldown time when an instance is in the Failed state and is waiting to be restarted again. If @david-crespo or @charliepark want to chat about this, I'm happy to help out!

Fixes #5553

[RFD 486] describes a principled design for how to use the Failed state for instances and VMMs. In particular, it proposes that: - Instances and VMMs should move to the `Failed` state independently. - A VMM goes to `Failed` if either: + Nexus observes that a VMM it thought was resident on a sled is no longer resident there, or, + it observes through some other means (sled expungement, sled reboot) that all VMMs that were resident on that sled must now be gone. - Instances go to `Failed` when an `instance-update` saga observes that an instance's active VMM has been marked `Failed`. An update saga transitioning an instance to failed cleans up the instance's resource allocations, similarly to how update sagas handle transitioning an instance to `Destroyed` when its active VMM has been destroyed. - For purposes of the external API: + An instance that is itself in the `Failed` state appears externally to be `Failed`. + If an instance in the "has-VMM" internal state and points to a `Failed` VMM, the instance is `Stopping` and then transitions to `Failed`. + Once the actual instance (not just its active VMM) is `Failed`, it can be started or destroyed. This branch implements the behavior described above. In particular, I've added code to the `instance-update` saga to handle instances whose active or migration target VMMs have transitioned to `Failed`, similarly to how it handles the VMMs transitioning to `Destroyed`. I've changed the code that detects whether a sled-agent error indicates that an instance has failed to only do so when the error is a 404 with the specific error code that indicates that the sled-agent has forgotten about a VMM that it previously knew about, and changed the Nexus functions that ask sled-agents to update an instance's state, and the `instance_watcher` background task, to mark instances as `Failed` when they encounter only the appropriate sled-agent error. And, I've changed the `mark_instance_failed` function in Nexus to instead mark the VMM record as failed and trigger an instance-update saga (and it's now named `mark_vmm_failed`), and made similar changes to the `instance_watcher` background task. Finally, I've also made sure that `Failed` instances can be destroyed and restarted, and that instances with `Failed` active VMMs appear to be `Stopping` until the VMM is cleaned up by an update saga. In addition to this, it was necessary to (re)introduce a `VmmState::Creating` variant. The purpose of this variant is to differentiate between VMMs which have been created in the database but do not yet exist on a sled, from those which are known to sled-agent and are actually starting/migrating. This is necessary because presently, the `instance_watcher` background task will attempt to perform health checks for `Starting` and `Migrating` VMMs, and --- if the sled-agent does not actually know about the VMM yet --- will move it to `Failed`. But, when a VMM is newly created, the sled-agent won't know about it yet until it has been registered with the sled-agent, so if the `instance_watcher` task activates while an `instance-start` or `instance-migrate` saga is in flight, the `instance-watcher` task may incorrectly move its VMM to `Failed` if the VMM record has been created but not yet registered. Now, after adding the `Creating` variant, we can avoid performing health checks for those VMMs until the sled-agent actually acknowledges a request to register the VMM. Besides the behavior implemented here, RFD 486 also proposes that the `boot_on_fault` field in the instance record be used by Nexus to determine whether to automatically attempt to restart a `Failed` instance. This is described in issues #6491and #4872. I'm planning to implement this in a separate branch, PR #6503. [RFD 486]: https://rfd.shared.oxide.computer/rfd/0486#_determinations

hawkw · 2024-09-03T15:59:46Z

Huh, this looks flaky: https://buildomat.eng.oxide.computer/wg/0/details/01J6T1X30H12CA5RSZT555AG5G/3vaGlLPSR6mbE4NtzsMac1xBduXXkHgAv2swx5SbzlkUQQE6/01J6T1XWB9Y0RGBXTP9ECXKC5J#S6146

@jmpesp

While I was working on #6503, I became quite taken with the way @jmpesp had done the OMDB status for his region-replacement background tasks --- the approach of defining a struct in `nexus-types` representing the status JSON that both the background task implementation and OMDB could depend on seems much nicer than constructing an untyped JSON message and then deserializing it to a type that only exists in OMDB, because this way, OMDB and the background task implementation are kept in sync automatically. So, I thought it would be nice to also rework the `abandoned_vmm_reaper` task to use this style. This commit does that. While I was making this change, I also changed the status JSON to include a list of all errors that occurred during the task's activation, instead of just the last one, which will probably make it easier to debug any issues that come up. I also added some code for aligning the numeric columns in the OMDB output.

@jmpesp

While I was working on #6503, I became quite taken with the way @jmpesp had done the OMDB status for his region-replacement background tasks --- the approach of defining a struct in `nexus-types` representing the status JSON that both the background task implementation and OMDB could depend on seems much nicer than constructing an untyped JSON message and then deserializing it to a type that only exists in OMDB, because this way, OMDB and the background task implementation are kept in sync automatically. So, I thought it would be nice to also rework the `abandoned_vmm_reaper` task to use this style. This commit does that. While I was making this change, I also changed the status JSON to include a list of all errors that occurred during the task's activation, instead of just the last one, which will probably make it easier to debug any issues that come up. I also added some code for aligning the numeric columns in the OMDB output.

hawkw · 2024-09-10T17:27:06Z

I intend to do some manual testing on london to make sure this works end-to-end before I'll be comfortable merging it. But, I'm going to go ahead and mark the PR as "ready for review" to start getting eyes on it in the meantime.

hawkw · 2024-09-10T19:02:48Z

I screwed up the test here a bit and am in the process of redoing it, please hold...

hawkw · 2024-09-10T22:08:16Z

OH YEAH IT WORKS!

Creating an instance (using the manual api command so that I can add an auto-restart policy, which the CLI does not know about yet):

eliza@theseus ~/Code/oxide/oxide.rs $ ./cli.sh api '/v1/instances?project=test' --method POST --input - <<EOF
{
    "name": "restartme",
    "description": "my cool instance",
    "hostname": "restartme",
    "memory": 34359738368,
    "ncpus": 2,
    "disks": [
        {
            "type": "attach",
            "name": "restartme-boot"
        }
    ],
    "network_interfaces": {
        "type": "default"
    },
    "external_ips": [
        {
            "type": "ephemeral",
            "pool_name": "default"
        }
    ],
    "user_data": "",
    "start": true,
    "auto_restart_policy": { "type": "all_failures" }
}
EOF
{
  "description": "my cool instance",
  "hostname": "restartme",
  "id": "76d6b13a-58f7-4b80-adcd-96ff3510485e",
  "memory": 34359738368,
  "name": "restartme",
  "ncpus": 2,
  "project_id": "0610d1e0-36f6-4530-b44d-b283189676eb",
  "run_state": "starting",
  "time_created": "2024-09-10T21:53:31.913311Z",
  "time_modified": "2024-09-10T21:53:31.913311Z",
  "time_run_state_updated": "2024-09-10T21:53:34.885879Z"
}

Thhe instance has started, and we can see it in omdb db instances:

eliza@theseus ~/Code/oxide/oxide.rs $ ssh londonswitch
The illumos Project     helios-2.0.22860        September 2024
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (97.0.0)
ID                                   NAME      STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
1c769a92-6769-4c9e-bb62-3fadee30de01 normal    running a994a91a-a0b0-4159-ad57-7131f1853671 d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
76d6b13a-58f7-4b80-adcd-96ff3510485e restartme running f6d80dd9-1e9c-4ef7-a5ae-9b805e8e1aab cf8ee87e-0858-48e9-964a-44a705113b1c BRM42220036

I pull the (metaphorical) plug on the sled that instance is running on:

root@oxz_switch1:~# pilot sp cycle BRM42220036
14  BRM42220036        ok

Was hoping to actually catch all the state transitions here, but unfortunately, it turns out that omdb db instances times out a lot when you kill a sled that turns out to have a CRDB replica on it?

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable


WARN: failed to query schema version: Service Unavailable: Failed to access DB connection: Request timed out
It's possible the database is running a version that's different from what this
tool understands.  This may result in errors or incorrect output.
^X^C

But, we can still see it come back, on a different sled (and, I caught the RUNNING -> STOPPING -> STARTING -> RUNNING) transition in the web UI! :D):

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (97.0.0)
ID                                   NAME      STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
76d6b13a-58f7-4b80-adcd-96ff3510485e restartme running 876279ed-496c-4793-9cf4-7cbe9a9658fd d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
1c769a92-6769-4c9e-bb62-3fadee30de01 normal    running a994a91a-a0b0-4159-ad57-7131f1853671 d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
root@oxz_switch1:~#

Connect over the bootstrap network with sprockets tls

Change the process of soft-deleting a volume from a CTE into a transaction. This is done to make changes easier to apply and review. One part of the CTE that remains is the query fragment that conditionally reduces volume references for region_snapshot rows, then returns only the rows updated. It was easier to keep this than figure out how to get it to work in diesel!

Create new enum types and return those to give more information to callers: - `create_region_snapshot_replacement_step` and `insert_region_snapshot_replacement_step` now return `InsertRegionSnapshotReplacementStepResult` - `volume_replace_region` and `volume_replace_snapshot` now return `VolumeReplaceResult` Notably, `VolumeReplaceResult::ExistingVolumeDeleted` replaces the previous error type `TargetVolumeDeleted`, and is not an error, allowing the caller to take action of the existing volume was deleted. This commit was peeled off work in progress to address #6353.

Presently, `omdb` can be used to query the database for instance records using the `omdb instances` command. However, this command does not output all the database state associated with an instance. Instead, it displays the instance's name, UUID, effective state, active Propolis UUID, sled UUID, and sled serial (the last three only if the instance is running). This is some of the most useful information about a customer instance, but it's not *all* of the information we have about it. Debugging issues related to instance state management may require other information from the instance record, as well as from the instance's active VMM record, and its target VMM and active migration records (if it is migrating). Adding all of these additional details to the `omdb db instances` command is a bit of a non-starter, though. Since that command outputs a table of all instances, with one line per instance, the output is already *much* wider than 80 columns, and requires a pretty wide terminal to be readable --- once the table has line wrapped, it's much harder to interpret. Therefore, I've added an `omdb db instance info` command that outputs a complete dump of the state of a *single* instance. Since this command is no longer constrained to putting all salient details about the instance on a single line in a table, we can show every field in the instance's DB representation without having to worry about wrapping. Output from `omdb db instance info` looks like this (yes, I spent entirely too long tweaking the formatting 😅): ```console oot@oxz_switch1:~# /var/tmp/omdb-eliza-test-15 db instance show fff916c4-814a-48a0-a04a-f887e6f81a9e note: database URL not specified. Will search DNS. note: (override with --db-url or OMDB_DB_URL) note: using DNS server for subnet fd00:1122:3344::/48 note: (if this is not right, use --dns-server to specify an alternate DNS server) note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable note: database schema version matches expected (98.0.0) INSTANCE ID: fff916c4-814a-48a0-a04a-f887e6f81a9e project ID: fe0da422-5c48-4b52-8010-f2fc401f090f name: app-13 description: application instance created at: 2024-06-22 00:31:43.889412 UTC last modified at: 2024-06-22 00:31:43.889412 UTC /!\ deleted at: 2024-06-22 00:43:59.156149 UTC CONFIGURATION vCPUs: 1 memory: 2 GiB hostname: node13 auto-restart policy: None RUNTIME STATE nexus state: Destroyed (i) external API state: Destroyed last updated at: 2024-06-22T00:43:49.440274Z (generation 5) active VMM ID: None target VMM ID: None migration ID: None updater lock: UNLOCKED at generation: 1 root@oxz_switch1:~# /var/tmp/omdb-eliza-test-15 db instance show 480c6167-dac2-4df7-808d-c9e776c1c312 note: database URL not specified. Will search DNS. note: (override with --db-url or OMDB_DB_URL) note: using DNS server for subnet fd00:1122:3344::/48 note: (if this is not right, use --dns-server to specify an alternate DNS server) note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable note: database schema version matches expected (98.0.0) INSTANCE ID: 480c6167-dac2-4df7-808d-c9e776c1c312 project ID: fe0da422-5c48-4b52-8010-f2fc401f090f name: mysql-3 description: MySQL Instance created at: 2024-09-16 00:43:10.264333 UTC last modified at: 2024-09-16 00:43:10.264333 UTC CONFIGURATION vCPUs: 4 memory: 16 GiB hostname: mysql-3 auto-restart policy: None RUNTIME STATE nexus state: Vmm (i) external API state: Stopping last updated at: 2024-09-19T23:15:50.698207Z (generation 5) active VMM ID: Some(51babee3-0ee5-4f67-bcbb-2189a39479a3) target VMM ID: None migration ID: None updater lock: UNLOCKED at generation: 4 active VMM record: Vmm { id: 51babee3-0ee5-4f67-bcbb-2189a39479a3, time_created: 2024-09-19T23:15:50.647454Z, time_deleted: None, instance_id: 480c6167-dac2-4df7-808d-c9e776c1c312, sled_id: 71def415-55ad-46b4-ba88-3ca55d7fb287, propolis_ip: V6( Ipv6Network { addr: fd00:1122:3344:10b::1:4be, prefix: 128, }, ), propolis_port: SqlU16( 12400, ), runtime: VmmRuntimeState { time_state_updated: 2024-09-19T23:18:03.015643Z, gen: Generation( Generation( 4, ), ), state: Stopping, }, } root@oxz_switch1:~# ```

Adds a module in `oximeter_db` that speaks the native ClickHouse protocol, a message-oriented protocol directly over TCP. This includes the core messaging and serde needed for making a handshake with the server, and then running most standard SQL queries. This uses the codecs from `tokio_util` to turn the TCP stream into a sink / stream of messges, which is much more convenient to operate on. The client and server exchange a handshake, after which the client may run SQL queries to select data. There are a number of tests in the module, including low-level serialization checks and asserting the results of actual SQL queries. In addition, this adds a barebones SQL shell -- this is not intended to replace the feature-rich official CLI, but is very useful for testing the protocol implementation by running arbitrary queries. This module is not consumed anywhere beyond the shell itself. A follow-up commit will integrate it into the existing `oximeter_db::Client` object, with the primary goal of making OxQL queries much more flexible and efficient.

A follow up PR will use this for clickhouse keeper allocation.

@davepacheco

Currently, the `instance-update` saga's `siu_commit_instance_updates` saga attempts to check if an additional update saga is needed and start one if so. This is done in the same action that writes back the updated instance record. In [this comment][1] on PR #6503, @davepacheco pointed out that conflating these two operations makes the idempotency tests for the update saga less effective, since the action performs multiple operations (even if some of them are permitted to fail). This commit refactors the instance-update saga so that the chaining operation is performed in a separate action from the rest of the saga. There should be no functional change to the saga's behavior as a result of this, beyond making the idempotency tests more correct. [1]: #6503 (comment)

PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).

PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 Fixes #2825, along with #6455 (which allowed restarting `Failed` instances).

When an `instance-start` saga unwinds, any VMM it created transitions to the `SagaUnwound` state. This causes the instance's effective state to appear as `Failed` in the external API. PR #6503 added functionality to Nexus to automatically restart instances that are in the `Failed` state ("instance reincarnation"). However, the current instance-reincarnation task will _not_ automatically restart instances whose instance-start sagas have unwound, because such instances are not actually in the `Failed` state from Nexus' perspective. This PR implements reincarnation for instances whose `instance-start` sagas have failed. This is done by changing the `instance_reincarnation` background task to query the database for instances which have `SagaUnwound` active VMMs, and then run `instance-start` sagas for them identically to how it runs start sagas for `Failed` instances. I decided to perform two separate queries to list `Failed` instances and to list instances with `SagaUnwound` VMMs, because the `SagaUnwound` query requires a join with the `vmm` table, and I thought it was a bit nicer to be able to find `Failed` instances without having to do the join, and only do it when looking for `SagaUnwound` ones. Also, having two queries makes it easier to distinguish between `Failed` and `SagaUnwound` instances in logging and the OMDB status output. This ended up being implemented by adding a parameter to the `DataStore::find_reincarnatable_instances` method that indicates which category of instances to select; I had previously considered making the method on the `InstanceReincarnation` struct that finds instances and reincarnates them take the query as a `Fn` taking the datastore and `DataPageParams` and returning an `impl Future` outputting `Result<Vec<Instance>, ...>`,but figuring out generic lifetimes for the pagination stuff was annoying enough that this felt like the simpler choice. Fixes #6638

This commit extends the `instance-reconfigure` API endpoint added in #6585 to also allow setting instance auto-restart policies (as added in #6503). I've also added the actual auto-restart policy to the external API instance view, along with the boolean `auto_restart_enabled` added in #6503. This way, it's possible to change just the boot disk by providing the current auto-restart policy in an instance POST.

hawkw force-pushed the eliza/boot-on-fault-isnt-just-a-bool branch from f42879c to 24b49e5 Compare September 1, 2024 20:17

hawkw force-pushed the eliza/instance-resurrection branch from ad60b37 to d520558 Compare September 1, 2024 20:17

This was referenced Sep 1, 2024

Better error reporting mechanism needed oxidecomputer/propolis#335

Open

need a way to trigger cleanup and next steps for vanished instances #4872

Closed

hawkw force-pushed the eliza/instance-resurrection branch 2 times, most recently from 1e94b6b to 15d272f Compare September 2, 2024 18:25

hawkw force-pushed the eliza/boot-on-fault-isnt-just-a-bool branch from 24b49e5 to 9ca4c5e Compare September 2, 2024 18:43

hawkw force-pushed the eliza/instance-resurrection branch from d3ae274 to e061959 Compare September 2, 2024 19:03

hawkw mentioned this pull request Sep 3, 2024

test failed in CI: oximeter-db::integration_test test_cluster #6508

Closed

hawkw force-pushed the eliza/instance-resurrection branch from e061959 to 3215e3f Compare September 4, 2024 16:30

This was referenced Sep 4, 2024

[nexus] Mark VMMs on Expunged sleds as Failed #6519

Merged

[nexus] Turn instance.boot_on_fault into an enum #6499

Merged

Base automatically changed from eliza/boot-on-fault-isnt-just-a-bool to main September 4, 2024 20:30

hawkw force-pushed the eliza/instance-resurrection branch 2 times, most recently from 6e11398 to ba41dda Compare September 5, 2024 17:20

hawkw mentioned this pull request Sep 7, 2024

[nexus] improve abandoned_vmm_reaper OMDB status #6541

Merged

hawkw force-pushed the eliza/instance-resurrection branch from 38e3dcb to 1d0ab52 Compare September 10, 2024 16:45

hawkw changed the title ~~[WIP][nexus] Reincarnate failed instances~~ [nexus] Reincarnate Failed instances Sep 10, 2024

hawkw marked this pull request as ready for review September 10, 2024 17:27

hawkw requested review from gjcolombo, smklein and davepacheco September 10, 2024 17:27

hawkw self-assigned this Sep 10, 2024

labbott and others added 7 commits September 21, 2024 12:31

Use sprockets on the bootstrap network (#6485)

aea76aa

Connect over the bootstrap network with sprockets tls

Add inventory Collection to BlueprintBuilder (#6624)

cd3236d

A follow up PR will use this for clickhouse keeper allocation.

include reincarnation details in OMDB and logs

efd0ffd

hawkw force-pushed the eliza/instance-resurrection branch from 4e56df5 to efd0ffd Compare September 21, 2024 19:31

hawkw added 2 commits September 23, 2024 08:39

Merge branch 'main' into eliza/instance-resurrection

7f459b8

cargo fmt

64033be

gjcolombo approved these changes Sep 23, 2024

View reviewed changes

hawkw merged commit c2e718f into main Sep 23, 2024
17 checks passed

hawkw deleted the eliza/instance-resurrection branch September 23, 2024 18:13

This was referenced Sep 23, 2024

reincarnate instances with SagaUnwound VMMs #6638

Closed

[nexus] want to allow stopping Failed instances #6640

Closed

davepacheco added this to the 11 milestone Sep 24, 2024

hawkw mentioned this pull request Sep 24, 2024

restart customer Instances after sled reboot #3633

Closed

This was referenced Sep 24, 2024

[nexus] Allow stopping Failed instances #6652

Merged

Tracking: Instance Lifecycle Overhaul #3742

Open

Expose instance auto-restart status in the console oxidecomputer/console#2469

Open

hawkw mentioned this pull request Sep 25, 2024

[nexus] Reincarnate instances with SagaUnwound VMMs #6669

Merged

hawkw mentioned this pull request Oct 1, 2024

[nexus] allow reconfiguring auto-restart policies #6743

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Reincarnate `Failed` instances #6503

[nexus] Reincarnate `Failed` instances #6503

hawkw commented Sep 1, 2024 •

edited by davepacheco

Loading

hawkw commented Sep 3, 2024

hawkw commented Sep 10, 2024

hawkw commented Sep 10, 2024

hawkw commented Sep 10, 2024

[nexus] Reincarnate Failed instances #6503

[nexus] Reincarnate Failed instances #6503

Conversation

hawkw commented Sep 1, 2024 • edited by davepacheco Loading

hawkw commented Sep 3, 2024

hawkw commented Sep 10, 2024

hawkw commented Sep 10, 2024

hawkw commented Sep 10, 2024

[nexus] Reincarnate `Failed` instances #6503

[nexus] Reincarnate `Failed` instances #6503

hawkw commented Sep 1, 2024 •

edited by davepacheco

Loading