-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Reincarnate Failed
instances
#6503
Conversation
f42879c
to
24b49e5
Compare
ad60b37
to
d520558
Compare
1e94b6b
to
15d272f
Compare
[RFD 486] describes a principled design for how to use the Failed state for instances and VMMs. In particular, it proposes that: - Instances and VMMs should move to the `Failed` state independently. - A VMM goes to `Failed` if either: + Nexus observes that a VMM it thought was resident on a sled is no longer resident there, or, + it observes through some other means (sled expungement, sled reboot) that all VMMs that were resident on that sled must now be gone. - Instances go to `Failed` when an `instance-update` saga observes that an instance's active VMM has been marked `Failed`. An update saga transitioning an instance to failed cleans up the instance's resource allocations, similarly to how update sagas handle transitioning an instance to `Destroyed` when its active VMM has been destroyed. - For purposes of the external API: + An instance that is itself in the `Failed` state appears externally to be `Failed`. + If an instance in the "has-VMM" internal state and points to a `Failed` VMM, the instance is `Stopping` and then transitions to `Failed`. + Once the actual instance (not just its active VMM) is `Failed`, it can be started or destroyed. This branch implements the behavior described above. In particular, I've added code to the `instance-update` saga to handle instances whose active or migration target VMMs have transitioned to `Failed`, similarly to how it handles the VMMs transitioning to `Destroyed`. I've changed the code that detects whether a sled-agent error indicates that an instance has failed to only do so when the error is a 404 with the specific error code that indicates that the sled-agent has forgotten about a VMM that it previously knew about, and changed the Nexus functions that ask sled-agents to update an instance's state, and the `instance_watcher` background task, to mark instances as `Failed` when they encounter only the appropriate sled-agent error. And, I've changed the `mark_instance_failed` function in Nexus to instead mark the VMM record as failed and trigger an instance-update saga (and it's now named `mark_vmm_failed`), and made similar changes to the `instance_watcher` background task. Finally, I've also made sure that `Failed` instances can be destroyed and restarted, and that instances with `Failed` active VMMs appear to be `Stopping` until the VMM is cleaned up by an update saga. In addition to this, it was necessary to (re)introduce a `VmmState::Creating` variant. The purpose of this variant is to differentiate between VMMs which have been created in the database but do not yet exist on a sled, from those which are known to sled-agent and are actually starting/migrating. This is necessary because presently, the `instance_watcher` background task will attempt to perform health checks for `Starting` and `Migrating` VMMs, and --- if the sled-agent does not actually know about the VMM yet --- will move it to `Failed`. But, when a VMM is newly created, the sled-agent won't know about it yet until it has been registered with the sled-agent, so if the `instance_watcher` task activates while an `instance-start` or `instance-migrate` saga is in flight, the `instance-watcher` task may incorrectly move its VMM to `Failed` if the VMM record has been created but not yet registered. Now, after adding the `Creating` variant, we can avoid performing health checks for those VMMs until the sled-agent actually acknowledges a request to register the VMM. Besides the behavior implemented here, RFD 486 also proposes that the `boot_on_fault` field in the instance record be used by Nexus to determine whether to automatically attempt to restart a `Failed` instance. This is described in issues #6491and #4872. I'm planning to implement this in a separate branch, PR #6503. [RFD 486]: https://rfd.shared.oxide.computer/rfd/0486#_determinations
24b49e5
to
9ca4c5e
Compare
d3ae274
to
e061959
Compare
e061959
to
3215e3f
Compare
6e11398
to
ba41dda
Compare
While I was working on #6503, I became quite taken with the way @jmpesp had done the OMDB status for his region-replacement background tasks --- the approach of defining a struct in `nexus-types` representing the status JSON that both the background task implementation and OMDB could depend on seems much nicer than constructing an untyped JSON message and then deserializing it to a type that only exists in OMDB, because this way, OMDB and the background task implementation are kept in sync automatically. So, I thought it would be nice to also rework the `abandoned_vmm_reaper` task to use this style. This commit does that. While I was making this change, I also changed the status JSON to include a list of all errors that occurred during the task's activation, instead of just the last one, which will probably make it easier to debug any issues that come up. I also added some code for aligning the numeric columns in the OMDB output.
While I was working on #6503, I became quite taken with the way @jmpesp had done the OMDB status for his region-replacement background tasks --- the approach of defining a struct in `nexus-types` representing the status JSON that both the background task implementation and OMDB could depend on seems much nicer than constructing an untyped JSON message and then deserializing it to a type that only exists in OMDB, because this way, OMDB and the background task implementation are kept in sync automatically. So, I thought it would be nice to also rework the `abandoned_vmm_reaper` task to use this style. This commit does that. While I was making this change, I also changed the status JSON to include a list of all errors that occurred during the task's activation, instead of just the last one, which will probably make it easier to debug any issues that come up. I also added some code for aligning the numeric columns in the OMDB output.
38e3dcb
to
1d0ab52
Compare
Failed
instances
I intend to do some manual testing on |
I screwed up the test here a bit and am in the process of redoing it, please hold... |
OH YEAH IT WORKS! Creating an instance (using the manual eliza@theseus ~/Code/oxide/oxide.rs $ ./cli.sh api '/v1/instances?project=test' --method POST --input - <<EOF
{
"name": "restartme",
"description": "my cool instance",
"hostname": "restartme",
"memory": 34359738368,
"ncpus": 2,
"disks": [
{
"type": "attach",
"name": "restartme-boot"
}
],
"network_interfaces": {
"type": "default"
},
"external_ips": [
{
"type": "ephemeral",
"pool_name": "default"
}
],
"user_data": "",
"start": true,
"auto_restart_policy": { "type": "all_failures" }
}
EOF
{
"description": "my cool instance",
"hostname": "restartme",
"id": "76d6b13a-58f7-4b80-adcd-96ff3510485e",
"memory": 34359738368,
"name": "restartme",
"ncpus": 2,
"project_id": "0610d1e0-36f6-4530-b44d-b283189676eb",
"run_state": "starting",
"time_created": "2024-09-10T21:53:31.913311Z",
"time_modified": "2024-09-10T21:53:31.913311Z",
"time_run_state_updated": "2024-09-10T21:53:34.885879Z"
} Thhe instance has started, and we can see it in eliza@theseus ~/Code/oxide/oxide.rs $ ssh londonswitch
The illumos Project helios-2.0.22860 September 2024
root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (97.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
1c769a92-6769-4c9e-bb62-3fadee30de01 normal running a994a91a-a0b0-4159-ad57-7131f1853671 d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
76d6b13a-58f7-4b80-adcd-96ff3510485e restartme running f6d80dd9-1e9c-4ef7-a5ae-9b805e8e1aab cf8ee87e-0858-48e9-964a-44a705113b1c BRM42220036 I pull the (metaphorical) plug on the sled that instance is running on: root@oxz_switch1:~# pilot sp cycle BRM42220036
14 BRM42220036 ok Was hoping to actually catch all the state transitions here, but unfortunately, it turns out that root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
WARN: failed to query schema version: Service Unavailable: Failed to access DB connection: Request timed out
It's possible the database is running a version that's different from what this
tool understands. This may result in errors or incorrect output.
^X^C But, we can still see it come back, on a different sled (and, I caught the root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (97.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
76d6b13a-58f7-4b80-adcd-96ff3510485e restartme running 876279ed-496c-4793-9cf4-7cbe9a9658fd d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
1c769a92-6769-4c9e-bb62-3fadee30de01 normal running a994a91a-a0b0-4159-ad57-7131f1853671 d5199739-9853-4fdf-a3a5-aefef4b5e5d2 BRM44220007
root@oxz_switch1:~# |
Connect over the bootstrap network with sprockets tls
Change the process of soft-deleting a volume from a CTE into a transaction. This is done to make changes easier to apply and review. One part of the CTE that remains is the query fragment that conditionally reduces volume references for region_snapshot rows, then returns only the rows updated. It was easier to keep this than figure out how to get it to work in diesel!
Create new enum types and return those to give more information to callers: - `create_region_snapshot_replacement_step` and `insert_region_snapshot_replacement_step` now return `InsertRegionSnapshotReplacementStepResult` - `volume_replace_region` and `volume_replace_snapshot` now return `VolumeReplaceResult` Notably, `VolumeReplaceResult::ExistingVolumeDeleted` replaces the previous error type `TargetVolumeDeleted`, and is not an error, allowing the caller to take action of the existing volume was deleted. This commit was peeled off work in progress to address #6353.
Presently, `omdb` can be used to query the database for instance records using the `omdb instances` command. However, this command does not output all the database state associated with an instance. Instead, it displays the instance's name, UUID, effective state, active Propolis UUID, sled UUID, and sled serial (the last three only if the instance is running). This is some of the most useful information about a customer instance, but it's not *all* of the information we have about it. Debugging issues related to instance state management may require other information from the instance record, as well as from the instance's active VMM record, and its target VMM and active migration records (if it is migrating). Adding all of these additional details to the `omdb db instances` command is a bit of a non-starter, though. Since that command outputs a table of all instances, with one line per instance, the output is already *much* wider than 80 columns, and requires a pretty wide terminal to be readable --- once the table has line wrapped, it's much harder to interpret. Therefore, I've added an `omdb db instance info` command that outputs a complete dump of the state of a *single* instance. Since this command is no longer constrained to putting all salient details about the instance on a single line in a table, we can show every field in the instance's DB representation without having to worry about wrapping. Output from `omdb db instance info` looks like this (yes, I spent entirely too long tweaking the formatting 😅): ```console oot@oxz_switch1:~# /var/tmp/omdb-eliza-test-15 db instance show fff916c4-814a-48a0-a04a-f887e6f81a9e note: database URL not specified. Will search DNS. note: (override with --db-url or OMDB_DB_URL) note: using DNS server for subnet fd00:1122:3344::/48 note: (if this is not right, use --dns-server to specify an alternate DNS server) note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable note: database schema version matches expected (98.0.0) INSTANCE ID: fff916c4-814a-48a0-a04a-f887e6f81a9e project ID: fe0da422-5c48-4b52-8010-f2fc401f090f name: app-13 description: application instance created at: 2024-06-22 00:31:43.889412 UTC last modified at: 2024-06-22 00:31:43.889412 UTC /!\ deleted at: 2024-06-22 00:43:59.156149 UTC CONFIGURATION vCPUs: 1 memory: 2 GiB hostname: node13 auto-restart policy: None RUNTIME STATE nexus state: Destroyed (i) external API state: Destroyed last updated at: 2024-06-22T00:43:49.440274Z (generation 5) active VMM ID: None target VMM ID: None migration ID: None updater lock: UNLOCKED at generation: 1 root@oxz_switch1:~# /var/tmp/omdb-eliza-test-15 db instance show 480c6167-dac2-4df7-808d-c9e776c1c312 note: database URL not specified. Will search DNS. note: (override with --db-url or OMDB_DB_URL) note: using DNS server for subnet fd00:1122:3344::/48 note: (if this is not right, use --dns-server to specify an alternate DNS server) note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable note: database schema version matches expected (98.0.0) INSTANCE ID: 480c6167-dac2-4df7-808d-c9e776c1c312 project ID: fe0da422-5c48-4b52-8010-f2fc401f090f name: mysql-3 description: MySQL Instance created at: 2024-09-16 00:43:10.264333 UTC last modified at: 2024-09-16 00:43:10.264333 UTC CONFIGURATION vCPUs: 4 memory: 16 GiB hostname: mysql-3 auto-restart policy: None RUNTIME STATE nexus state: Vmm (i) external API state: Stopping last updated at: 2024-09-19T23:15:50.698207Z (generation 5) active VMM ID: Some(51babee3-0ee5-4f67-bcbb-2189a39479a3) target VMM ID: None migration ID: None updater lock: UNLOCKED at generation: 4 active VMM record: Vmm { id: 51babee3-0ee5-4f67-bcbb-2189a39479a3, time_created: 2024-09-19T23:15:50.647454Z, time_deleted: None, instance_id: 480c6167-dac2-4df7-808d-c9e776c1c312, sled_id: 71def415-55ad-46b4-ba88-3ca55d7fb287, propolis_ip: V6( Ipv6Network { addr: fd00:1122:3344:10b::1:4be, prefix: 128, }, ), propolis_port: SqlU16( 12400, ), runtime: VmmRuntimeState { time_state_updated: 2024-09-19T23:18:03.015643Z, gen: Generation( Generation( 4, ), ), state: Stopping, }, } root@oxz_switch1:~# ```
Adds a module in `oximeter_db` that speaks the native ClickHouse protocol, a message-oriented protocol directly over TCP. This includes the core messaging and serde needed for making a handshake with the server, and then running most standard SQL queries. This uses the codecs from `tokio_util` to turn the TCP stream into a sink / stream of messges, which is much more convenient to operate on. The client and server exchange a handshake, after which the client may run SQL queries to select data. There are a number of tests in the module, including low-level serialization checks and asserting the results of actual SQL queries. In addition, this adds a barebones SQL shell -- this is not intended to replace the feature-rich official CLI, but is very useful for testing the protocol implementation by running arbitrary queries. This module is not consumed anywhere beyond the shell itself. A follow-up commit will integrate it into the existing `oximeter_db::Client` object, with the primary goal of making OxQL queries much more flexible and efficient.
A follow up PR will use this for clickhouse keeper allocation.
4e56df5
to
efd0ffd
Compare
Currently, the `instance-update` saga's `siu_commit_instance_updates` saga attempts to check if an additional update saga is needed and start one if so. This is done in the same action that writes back the updated instance record. In [this comment][1] on PR #6503, @davepacheco pointed out that conflating these two operations makes the idempotency tests for the update saga less effective, since the action performs multiple operations (even if some of them are permitted to fail). This commit refactors the instance-update saga so that the chaining operation is performed in a separate action from the rest of the saga. There should be no functional change to the saga's behavior as a result of this, beyond making the idempotency tests more correct. [1]: #6503 (comment)
PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).
PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).
PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 Fixes #2825, along with #6455 (which allowed restarting `Failed` instances).
When an `instance-start` saga unwinds, any VMM it created transitions to the `SagaUnwound` state. This causes the instance's effective state to appear as `Failed` in the external API. PR #6503 added functionality to Nexus to automatically restart instances that are in the `Failed` state ("instance reincarnation"). However, the current instance-reincarnation task will _not_ automatically restart instances whose instance-start sagas have unwound, because such instances are not actually in the `Failed` state from Nexus' perspective. This PR implements reincarnation for instances whose `instance-start` sagas have failed. This is done by changing the `instance_reincarnation` background task to query the database for instances which have `SagaUnwound` active VMMs, and then run `instance-start` sagas for them identically to how it runs start sagas for `Failed` instances. I decided to perform two separate queries to list `Failed` instances and to list instances with `SagaUnwound` VMMs, because the `SagaUnwound` query requires a join with the `vmm` table, and I thought it was a bit nicer to be able to find `Failed` instances without having to do the join, and only do it when looking for `SagaUnwound` ones. Also, having two queries makes it easier to distinguish between `Failed` and `SagaUnwound` instances in logging and the OMDB status output. This ended up being implemented by adding a parameter to the `DataStore::find_reincarnatable_instances` method that indicates which category of instances to select; I had previously considered making the method on the `InstanceReincarnation` struct that finds instances and reincarnates them take the query as a `Fn` taking the datastore and `DataPageParams` and returning an `impl Future` outputting `Result<Vec<Instance>, ...>`,but figuring out generic lifetimes for the pagination stuff was annoying enough that this felt like the simpler choice. Fixes #6638
This commit extends the `instance-reconfigure` API endpoint added in #6585 to also allow setting instance auto-restart policies (as added in #6503). I've also added the actual auto-restart policy to the external API instance view, along with the boolean `auto_restart_enabled` added in #6503. This way, it's possible to change just the boot disk by providing the current auto-restart policy in an instance POST.
This commit extends the `instance-reconfigure` API endpoint added in #6585 to also allow setting instance auto-restart policies (as added in #6503). I've also added the actual auto-restart policy to the external API instance view, along with the boolean `auto_restart_enabled` added in #6503. This way, it's possible to change just the boot disk by providing the current auto-restart policy in an instance POST.
This commit extends the `instance-reconfigure` API endpoint added in #6585 to also allow setting instance auto-restart policies (as added in #6503). I've also added the actual auto-restart policy to the external API instance view, along with the boolean `auto_restart_enabled` added in #6503. This way, it's possible to change just the boot disk by providing the current auto-restart policy in an instance POST.
As of #6455, instances in the
Failed
state are allowed to be restarted. As described in RFD 486 § 4.3, Nexus should automatically restart such instances when the instance record'sauto_restart_policy
field (néeboot_on_fault
, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a newinstance_reincarnation
RPW. This RPW consists of a background task which queries CRDB for instances in theFailed
state which are eligible to be automatically restarted, and starts a newinstance-start
saga for any instances it discovers.Instances are considered eligible for reincarnation if all of the following conditions are true:
InstanceState::Failed
state. Meaning, it must have no active VMM --- if the instance is in theInstanceState::Vmm
state with an active VMM which is inVmmState::Failed
, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned toFailed
before attempting to restart it.auto_restart_policy
permits it to be restarted. Presently, this means that only instances with theInstanceAutoRestart::AllFailures
policy will be automatically restarted. TheInstanceAutoRestart::SledFailuresOnly
policy permits the instance to be restarted if and only if Nexus is able to definitively determine the instance failed due to a fault that effected the entire sled, and, at present, we don't record this information when transitioning an instance toFailed
. Therefore, we will not currently reincarnate such instances. Future work will provide mechanisms for Nexus to determine whether an instance has transitioned toFailed
due to a sled-wide failure, permitting instances with theSledFailuresOnly
policy to be reincarnated depending on why they were marked asFailed
.The implementation of the actual reincarnation background task is relatively straightforward: it runs a new
DataStore::find_reincarnatable_instances
query, which returns instances in theFailed
state, and it spawnsinstance-start
sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a newtime_last_auto_restarted
field to theinstance
table and to theInstanceRuntimeState
type. We add aReason
enum to theinstance-start
saga'sParams
that indicates why the instance is being started, and if it isReason::AutoRestart
, then the start saga will set thetime_last_auto_restarted
timestamp when moving the instance record to theStarting
state. Additionally, I've added theauto_restart_policy
field to theparams::InstanceCreate
type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to aparams::InstanceCreate
literal.This branch adds a minimal, initial implementation of instance auto-restarting. There's plenty of follow-up work to do when this lands:
Failed
. In order to actually implement theSledFailuresOnly
instance auto-restart policy, we need to...actually know when an instance was marked asFailed
due to a sled-wide failure rather than an instance-specific failure. In some cases, such as when a VMM is marked asFailed
due to a sled-agent returning a 404NO_SUCH_INSTANCE
error indicating that it is no longer aware of the VMM, determining this is fairly complex: was the instance forgotten because the sled-agent (or the entire sled) restarted, or was it forgotten because its individual VMM was restarted? But, in other cases, like when a VMM was markedFailed
because it was on anExpunged
sled (see [nexus] Mark VMMs onExpunged
sleds asFailed
#6519), the failure is clearly sled-wide. We should record why a VMM/instance was markedFailed
, at least when it's possible to do so, so that the failure reason could be considered by the auto-restart policy.Starting
state because they were explicitly started by the user, and instances that are automatically restarting due to a failure in the UI, and we may also want to show the cooldown time when an instance is in theFailed
state and is waiting to be restarted again. If @david-crespo or @charliepark want to chat about this, I'm happy to help out!Fixes #5553