-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Unbreak handling of Propolis error codes #6726
Conversation
Okay, it looks like this does work. If I have a running instance, and I eliza@theseus ~/Code/oxide/oxide.rs $ ssh londonswitch
The illumos Project helios-2.0.22923 September 2024
root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (106.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
2433197b-fd8f-4655-9f49-19339b8a327a labrat running 7d64e906-a0ab-4662-9330-6de0eb691831 30a2fe41-4a2c-44cd-a1f4-7b17c5181371 BRM42220036
root@oxz_switch1:~# pilot host login BRM42220036
#####
## ##
## # ## ## ##
## # ## ## ## Oxide Computer Company
## # ## ### Engineering
## ## ## ##
##### ## ## Gimlet
BRM42220036 # /opt/oxide/oxlog/oxlog zones
global
oxz_cockroachdb_7f8b3453-e625-424b-8306-69507a55e1e4
oxz_crucible_07d6e3ee-b489-4698-b7c7-b2dfc456cd84
oxz_crucible_0974390c-b98c-4116-bca0-f716cda87587
oxz_crucible_1d3b3b88-feaa-4cf3-af31-627fcd1dfb06
oxz_crucible_31d9ddca-164d-49a2-80ee-f047cf7874e6
oxz_crucible_7471f78f-7b3d-4aaf-86ff-779c8644b335
oxz_crucible_8fac3aae-227a-4f62-a06d-e777a3fdbe54
oxz_crucible_c1aa0788-68cb-4083-95d6-cca9c15a8e12
oxz_crucible_f7d1b214-55ef-4db6-91ad-cdec7a367169
oxz_crucible_f8c20512-e9da-4329-8569-be0733a70217
oxz_crucible_pantry_99df66d0-bccc-4d44-b793-164036844faa
oxz_external_dns_a73b9c9c-fa12-453c-9546-0ddc96411674
oxz_internal_dns_63cb81fd-47fd-4065-9fc8-e0df2f85f396
oxz_ntp_a2995263-07d9-48d5-945d-ef5dbd58fcda
oxz_oximeter_2ab67cd8-2917-4191-8fcf-3e80de09328b
oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
oxz_switch
BRM42220036 # zlogin oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
[Connected to zone 'oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831' pts/5]
The illumos Project helios-2.0.22923 September 2024
root@oxz_propolis:~# svcs
STATE STIME FMRI
legacy_run 19:23:58 lrc:/etc/rc2_d/S20sysetup
legacy_run 19:23:58 lrc:/etc/rc2_d/S89PRESERVE
online 19:23:52 svc:/system/svc/restarter:default
online 19:23:53 svc:/system/early-manifest-import:default
online 19:23:53 svc:/system/filesystem/root:default
online 19:23:53 svc:/network/ip-interface-management:default
online 19:23:53 svc:/network/netcfg:default
online 19:23:53 svc:/network/datalink-management:default
online 19:23:53 svc:/system/boot-archive:default
online 19:23:54 svc:/network/loopback:default
online 19:23:54 svc:/system/filesystem/usr:default
online 19:23:54 svc:/system/device/local:default
online 19:23:54 svc:/network/physical:default
online 19:23:54 svc:/milestone/devices:default
online 19:23:54 svc:/network/ipsec/ipsecalgs:default
online 19:23:54 svc:/system/identity:node
online 19:23:54 svc:/system/device/audio:default
online 19:23:54 svc:/system/filesystem/minimal:default
online 19:23:54 svc:/network/ipsec/policy:default
online 19:23:54 svc:/system/rmtmpfiles:default
online 19:23:54 svc:/milestone/network:default
online 19:23:54 svc:/network/initial:default
online 19:23:54 svc:/network/iptun:default
online 19:23:54 svc:/system/hostid:default
online 19:23:54 svc:/network/netmask:default
online 19:23:54 svc:/system/auditset:default
online 19:23:54 svc:/system/logadm-upgrade:default
online 19:23:54 svc:/system/pfexec:default
online 19:23:54 svc:/system/pkgserv:default
online 19:23:54 svc:/system/fmd:default
online 19:23:56 svc:/system/rbac:default
online 19:23:57 svc:/network/service:default
online 19:23:57 svc:/system/cryptosvc:default
online 19:23:57 svc:/system/identity:domain
online 19:23:57 svc:/system/keymap:default
online 19:23:57 svc:/system/name-service-cache:default
online 19:23:57 svc:/milestone/name-services:default
online 19:23:57 svc:/system/manifest-import:default
online 19:23:57 svc:/milestone/single-user:default
online 19:23:57 svc:/system/coreadm:default
online 19:23:57 svc:/milestone/sysconfig:default
online 19:23:57 svc:/system/filesystem/local:default
online 19:23:57 svc:/system/utmp:default
online 19:23:58 svc:/system/system-log:default
online 19:23:58 svc:/system/cron:default
online 19:23:58 svc:/system/console-login:default
online 19:23:58 svc:/system/boot-archive-update:default
online 19:23:58 svc:/network/shares/group:default
online 19:23:58 svc:/milestone/multi-user:default
online 19:23:58 svc:/system/boot-config:default
online 19:23:58 svc:/network/shares/group:zfs
online 19:23:58 svc:/network/routing-setup:default
online 19:23:58 svc:/system/update-man-index:default
online 19:23:58 svc:/network/ssh:default
online 19:23:58 svc:/milestone/multi-user-server:default
online 19:23:58 svc:/system/illumos/propolis-server:default
online 19:24:08 svc:/network/routing/ndp:default
root@oxz_propolis:~# svcadm restart svc:/system/illumos/propolis-server:default
root@oxz_propolis:~#
[Connection to zone 'oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831' pts/5 closed]
BRM42220036 # exit
logout
root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (106.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
2433197b-fd8f-4655-9f49-19339b8a327a labrat failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~# Meanwhile, in the sled-agent logs, we can see it observe the Propolis go away, and recognize the error code from the restarted Propolis as indicating VM-gone-ness:
And, if we pop back over to BRM2220036, we can see the defunct zone has been removed. So that seems good! root@oxz_switch1:~# pilot host login BRM42220036
#####
## ##
## # ## ## ##
## # ## ## ## Oxide Computer Company
## # ## ### Engineering
## ## ## ##
##### ## ## Gimlet
BRM42220036 # zoneadm list
global
oxz_switch
oxz_internal_dns_63cb81fd-47fd-4065-9fc8-e0df2f85f396
oxz_ntp_a2995263-07d9-48d5-945d-ef5dbd58fcda
oxz_cockroachdb_7f8b3453-e625-424b-8306-69507a55e1e4
oxz_crucible_f8c20512-e9da-4329-8569-be0733a70217
oxz_crucible_0974390c-b98c-4116-bca0-f716cda87587
oxz_crucible_1d3b3b88-feaa-4cf3-af31-627fcd1dfb06
oxz_crucible_31d9ddca-164d-49a2-80ee-f047cf7874e6
oxz_crucible_7471f78f-7b3d-4aaf-86ff-779c8644b335
oxz_crucible_pantry_99df66d0-bccc-4d44-b793-164036844faa
oxz_crucible_8fac3aae-227a-4f62-a06d-e777a3fdbe54
oxz_crucible_c1aa0788-68cb-4083-95d6-cca9c15a8e12
oxz_crucible_f7d1b214-55ef-4db6-91ad-cdec7a367169
oxz_external_dns_a73b9c9c-fa12-453c-9546-0ddc96411674
oxz_crucible_07d6e3ee-b489-4698-b7c7-b2dfc456cd84
oxz_oximeter_2ab67cd8-2917-4191-8fcf-3e80de09328b
BRM42220036 # |
(from the sled-agent logs it does look like it publishes the state generation 5 to nexus twice, which i ought to figure out...) |
Will restart that and see if it's still busted. |
sled-agent/src/instance.rs
Outdated
// TODO(eliza): this is kinda jank but Propolis | ||
// currently could return the same error when the | ||
// instance has not been ensured and when it's still | ||
// coming up, which is sad. I'm not sure whether | ||
// that happens IRL or not --- investigate this. | ||
let reaction = if has_ever_seen_a_good_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extra junk is because I noticed that Propolis' instance-state-monitor endpoint will presently return the NoInstance
error on both VmError::NotCreated
and VmError::WaitingToInitialize
:
This gave me some anxiety about sled-agent potentially deciding that a Propolis has "died" while it's actually still waiting to be born, although I'm not sure whether we would actually see the WaitingToInitialize
error there in practice or not. I think the proper thing to do would be to change that behavior in Propolis, but I wanted to get something working in sled-agent regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after actually looking in Propolis, it looks like the Vm::state_watcher()
function will never actually return VmError::WaitingToInitialize
; it will only return VmError::NotCreated
:
https://github.com/oxidecomputer/propolis/blob/11371b0f3743f8df5b047dc0edc2699f4bdf3927/bin/propolis-server/src/lib/vm/mod.rs#L353-L364
So, hitting the state-monitor endpoint should just return a Starting
VMM state in that case, which is good. I still don't like that the method can return any of the error types, though, even though it doesn't currently --- I'll go try to clean that up on the Propolis side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, oxidecomputer/propolis#771 will make the propolis side way less scary. But I think this code can also just be removed as propolis' current behavior is also correct.
sled-agent/src/instance.rs
Outdated
"error" => ?code, | ||
); | ||
self.terminate(true).await; | ||
self.publish_state_to_nexus().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably why we sent the failed state to nexus twice in my manual testing, so I think it can be removed.
Currently, the `Vm::get()` and `Vm::state_watcher()` methods return a `Result<{something}, VmError>`. This means that they can return _any_ `VmError` variant, but the current implementation of these methods will only ever return `VmError::NotCreated`. This gave me a bit of anxiety when I noticed that we were presently handling the `VmError::WaitingForInit` variant identically to `NotCreated`, which seemed wrong as it implied that we could, potentially, return an error code indicating that no instance has ever been ensured when in fact, one _has_ been ensured but is still being initialized. I freaked out about this a bit here: oxidecomputer/omicron#6726 (comment) It turns out the current behavior is actually fine, since if the VM is still initializing, we still allow accessing it and just return a `Starting` state, which is correct. But the type signatures of these functions allow them to return any of these errors, and forces their callers to handle those errors, and that's the part we were doing somewhat incorrectly (although, again, in practice it doesn't matter). This commit changes those functions to return an `Option` rather than a `Result`. Now, we return `None` if no VM has ever been created, making that the _only_ error case that callers have to handle. There's no longer any risk of the HTTP API handlers accidentally returning a "VM never ensured" error when the VM is ensured-but-initializing. Also, we can remove the `NotCreated` variant, since it's now no longer returned anywhere.
Currently, the `Vm::get()` and `Vm::state_watcher()` methods return a `Result<{something}, VmError>`. This means that they can return _any_ `VmError` variant, but the current implementation of these methods will only ever return `VmError::NotCreated`. This gave me a bit of anxiety when I noticed that we were presently handling the `VmError::WaitingForInit` variant identically to `NotCreated`, which seemed wrong as it implied that we could, potentially, return an error code indicating that no instance has ever been ensured when in fact, one _has_ been ensured but is still being initialized. I freaked out about this a bit here: oxidecomputer/omicron#6726 (comment) It turns out the current behavior is actually fine, since if the VM is still initializing, we still allow accessing it and just return a `Starting` state, which is correct. But the type signatures of these functions allow them to return any of these errors, and forces their callers to handle those errors, and that's the part we were doing somewhat incorrectly (although, again, in practice it doesn't matter). This commit changes those functions to return an `Option` rather than a `Result`. Now, we return `None` if no VM has ever been created, making that the _only_ error case that callers have to handle. There's no longer any risk of the HTTP API handlers accidentally returning a "VM never ensured" error when the VM is ensured-but-initializing. Also, we can remove the `NotCreated` variant, since it's now no longer returned anywhere.
I'm going to redo my manual testing from #6726 (comment) with the latest changes from b59cfc1, but I'm pretty sure that it'll still work. Figured it was worth making sure, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest @pfmooney take a look.
Perhaps something like oxidecomputer/dropshot#39 would better allow propolis to communicate actionable errors.
At present, sled-agent's handling of the error codes used by Propolis to indicate that it has crashed and been restarted is woefully incorrect. In particular, there are two cases where such an error may be encountered by a sled-agent: 1. When attempting to ask the VMM to change state (e.g. reboot or stop the instance) 2. When hitting Propolis' `instance-state-monitor` API endpoint to proactively check the VM's current state Neither of these are handled correctly today, In the first case, if a sled-agent operation on behalf of Nexus encounters a client error from Propolis, it will forward that error code to Nexus...but, _Nexus_ only moves instances to `Failed` in the face of sled-agent's `410 NO_SUCH_INSTANCE`, and _not_ [if it sees the `NoInstance` error code from Propolis][1], which means that the ghosts left behind by crashed and restarted Propolii still won't be moved to `Failed`. Agh. Furthermore, in that case, sled-agent itself will not perform the necessary cleanup actions to deal with the now-defunct Propolis zone (e.g. collecting a zone bundle and then destroying the zone). In the second case, where we hit the instance-state-monitor endpoint and get back a `NoInstance` error, things are equally dire. The `InstanceMonitorRunner` task, which is responsible for polling the state monitor endpoint, will just bail out on receipt of any error from Propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L253-L289 We would _expect_ this to drop the send-side of the channel it uses to communicate with the `InstanceRunner`, closing the channel, and hitting this select, which would correctly mark the VMM as failed and do what we want, despite eating the actual error message from propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L388-L394 HOWEVER, as the comment points out, the `InstanceRunner` is _also_ holding its own clone of the channel sender, keeping us from ever hitting that case: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L308-L309 AFAICT, this means we'll just totally give up on monitoring the instance as soon as we hit any error here, which seems...quite bad. I *think* that this actually means that when a Propolis process crashes unexpectedly, we'll get an error when the TCP connection closes, bail out, and then _never try hitting the instance monitor endpoint ever again_[^1]. So, we won't notice that the Propolis is actually out to lunch until we try to ask it to change state. **This Is Not Great!** This commit fixes both of these cases, by making sled-agent actually handle Propolis' error codes correctly. I've added a dependency on the `propolis_api_types` crate, which is where the error code lives, and some code in sled-agent to attempt to parse these codes when receiving an error response from the Propolis client. Now, when we try to PUT a new state to the instance, we'll look at the error code that comes back, mark it as `Failed` if the error indicates that we should do so, and publish the `Failed` VMM state to Nexus before tearing down the zone. The `InstanceMonitorTask` now behaves similarly, and I've changed it to retry all other errors with a backoff, rather than just bailing out immediately on receipt fo the first error. I've manually tested this on `london` as discussed here: #6726 (comment) Unfortunately, it's hard to do any kind of automated testing of this with the current test situation, as none of this code is exercised by the simulated sled-agent. Fixes #3209 Fixes #3238 [^1]: Although I've not actually verified that this is what happens. [1]: https://github.com/oxidecomputer/propolis/blob/516dabe473cecdc3baea1db98a80441968c144cf/crates/propolis-api-types/src/lib.rs#L434-L441
Okay, I've re-tested this after the Dogfood update to c50cf01, and it definitely still works --- if I restart a In the switch zone, inspecting my newly created test instance: root@oxz_switch0:~# omdb db instance info cb44ae27-befb-424a-b9f1-087ac9d0c898
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (107.0.0)
== INSTANCE ====================================================================
ID: cb44ae27-befb-424a-b9f1-087ac9d0c898
project ID: 50d839b4-2d18-49c0-94cf-7d6883e00178
name: test-instance-please-ignore
description: let's see what happens when propolis crashes
created at: 2024-10-02 21:10:26.781544 UTC
last modified at: 2024-10-02 21:10:26.781544 UTC
== CONFIGURATION ===============================================================
vCPUs: 2
memory: 8 GiB
hostname: test-instance-please-ignore
boot disk: None
auto-restart:
InstanceAutoRestart {
policy: None,
cooldown: None,
}
== RUNTIME STATE ===============================================================
nexus state: Vmm
(i) external API state: Running
last updated at: 2024-10-02T21:10:26.781544Z (generation 3)
needs reincarnation: false
karmic status: bound to saṃsāra
last reincarnated at: None
active VMM ID: Some(89437bf4-da6b-441d-bc9c-c9344b6893e3)
target VMM ID: None
migration ID: None
updater lock: UNLOCKED at generation: 1
active VMM record:
Vmm {
id: 89437bf4-da6b-441d-bc9c-c9344b6893e3,
time_created: 2024-10-02T21:10:30.927858Z,
time_deleted: None,
instance_id: cb44ae27-befb-424a-b9f1-087ac9d0c898,
sled_id: 5f6720b8-8a31-45f8-8c94-8e699218f28b,
propolis_ip: V6(
Ipv6Network {
addr: fd00:1122:3344:109::1:4f6,
prefix: 128,
},
),
propolis_port: SqlU16(
12400,
),
runtime: VmmRuntimeState {
time_state_updated: 2024-10-02T21:10:41.851556Z,
gen: Generation(
Generation(
4,
),
),
state: Running,
},
}
root@oxz_switch0:~# log into the instance's sled and restart its propolis-server root@oxz_switch0:~# pilot host login BRM42220017
#####
## ##
## # ## ## ##
## # ## ## ## Oxide Computer Company
## # ## ### Engineering
## ## ## ##
##### ## ## Gimlet
BRM42220017 # zlogin oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3
[Connected to zone 'oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3' pts/3]
root@oxz_propolis:~# svcadm restart svc:/system/illumos/propolis-server:default
root@oxz_propolis:~#
[Connection to zone 'oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3' pts/3 closed]
BRM42220017 # exit
logout
root@oxz_switch0:~# instance is restarted successfully! root@oxz_switch0:~# omdb db instance info cb44ae27-befb-424a-b9f1-087ac9d0c898
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (107.0.0)
== INSTANCE ====================================================================
ID: cb44ae27-befb-424a-b9f1-087ac9d0c898
project ID: 50d839b4-2d18-49c0-94cf-7d6883e00178
name: test-instance-please-ignore
description: let's see what happens when propolis crashes
created at: 2024-10-02 21:10:26.781544 UTC
last modified at: 2024-10-02 21:10:26.781544 UTC
== CONFIGURATION ===============================================================
vCPUs: 2
memory: 8 GiB
hostname: test-instance-please-ignore
boot disk: None
auto-restart:
InstanceAutoRestart {
policy: None,
cooldown: None,
}
== RUNTIME STATE ===============================================================
nexus state: Vmm
(i) external API state: Starting
last updated at: 2024-10-02T21:14:44.002990Z (generation 5)
needs reincarnation: false
/!\ karmic status: cooling down (TimeDelta { secs: 3589, nanos: 929383990 } remaining)
last reincarnated at: Some(2024-10-02T21:14:45.996494Z)
active VMM ID: Some(9111dfe7-0d55-4c6b-a72c-1f35ba9b5fc2)
target VMM ID: None
migration ID: None
updater lock: UNLOCKED at generation: 4
active VMM record:
Vmm {
id: 9111dfe7-0d55-4c6b-a72c-1f35ba9b5fc2,
time_created: 2024-10-02T21:14:45.918847Z,
time_deleted: None,
instance_id: cb44ae27-befb-424a-b9f1-087ac9d0c898,
sled_id: f15774c1-b8e5-434f-a493-ec43f96cba06,
propolis_ip: V6(
Ipv6Network {
addr: fd00:1122:3344:105::1:5d5,
prefix: 128,
},
),
propolis_port: SqlU16(
12400,
),
runtime: VmmRuntimeState {
time_state_updated: 2024-10-02T21:14:46.684417Z,
gen: Generation(
Generation(
2,
),
),
state: Starting,
},
}
root@oxz_switch0:~# |
At present, sled-agent's handling of the error codes used by Propolis to
indicate that it has crashed and been restarted is woefully incorrect.
In particular, there are two cases where such an error may be
encountered by a sled-agent:
the instance)
instance-state-monitor
API endpoint toproactively check the VM's current state
Neither of these are handled correctly today,
In the first case, if a sled-agent operation on behalf of Nexus
encounters a client error from Propolis, it will forward that error code
to Nexus...but, Nexus only moves instances to
Failed
in the face ofsled-agent's
410 NO_SUCH_INSTANCE
, and not if it sees theNoInstance
error code from Propolis, which means that the ghostsleft behind by crashed and restarted Propolii still won't be moved to
Failed
. Agh. Furthermore, in that case, sled-agent itself will notperform the necessary cleanup actions to deal with the now-defunct
Propolis zone (e.g. collecting a zone bundle and then destroying the
zone).
In the second case, where we hit the instance-state-monitor endpoint and
get back a
NoInstance
error, things are equally dire. TheInstanceMonitorRunner
task, which is responsible for polling the statemonitor endpoint, will just bail out on receipt of any error from
Propolis:
omicron/sled-agent/src/instance.rs
Lines 253 to 289 in 888f6a1
We would expect this to drop the send-side of the channel it uses to
communicate with the
InstanceRunner
, closing the channel, and hittingthis select, which would correctly mark the VMM as failed and do what we
want, despite eating the actual error message from propolis:
omicron/sled-agent/src/instance.rs
Lines 388 to 394 in 888f6a1
HOWEVER, as the comment points out, the
InstanceRunner
is alsoholding its own clone of the channel sender, keeping us from ever
hitting that case:
omicron/sled-agent/src/instance.rs
Lines 308 to 309 in 888f6a1
AFAICT, this means we'll just totally give up on monitoring the instance
as soon as we hit any error here, which seems...quite bad. I think
that this actually means that when a Propolis process crashes
unexpectedly, we'll get an error when the TCP connection closes, bail
out, and then never try hitting the instance monitor endpoint ever
again1. So, we won't notice that the Propolis is actually out to
lunch until we try to ask it to change state. This Is Not Great!
This commit fixes both of these cases, by making sled-agent actually
handle Propolis' error codes correctly. I've added a dependency on the
propolis_api_types
crate, which is where the error code lives, andsome code in sled-agent to attempt to parse these codes when receiving
an error response from the Propolis client. Now, when we try to PUT a
new state to the instance, we'll look at the error code that comes back,
mark it as
Failed
if the error indicates that we should do so, andpublish the
Failed
VMM state to Nexus before tearing down the zone.The
InstanceMonitorTask
now behaves similarly, and I've changed it toretry all other errors with a backoff, rather than just bailing out
immediately on receipt fo the first error.
I've manually tested this on
london
as discussed here:#6726 (comment)
Unfortunately, it's hard to do any kind of automated testing of this
with the current test situation, as none of this code is exercised by
the simulated sled-agent.
Fixes #3209
Fixes #3238
Footnotes
Although I've not actually verified that this is what happens. ↩