[sled-agent] Unbreak handling of Propolis error codes #6726

hawkw · 2024-09-30T17:27:12Z

At present, sled-agent's handling of the error codes used by Propolis to
indicate that it has crashed and been restarted is woefully incorrect.
In particular, there are two cases where such an error may be
encountered by a sled-agent:

When attempting to ask the VMM to change state (e.g. reboot or stop
the instance)
When hitting Propolis' instance-state-monitor API endpoint to
proactively check the VM's current state

Neither of these are handled correctly today,

In the first case, if a sled-agent operation on behalf of Nexus
encounters a client error from Propolis, it will forward that error code
to Nexus...but, Nexus only moves instances to Failed in the face of
sled-agent's 410 NO_SUCH_INSTANCE, and not if it sees the
NoInstance error code from Propolis, which means that the ghosts
left behind by crashed and restarted Propolii still won't be moved to
Failed. Agh. Furthermore, in that case, sled-agent itself will not
perform the necessary cleanup actions to deal with the now-defunct
Propolis zone (e.g. collecting a zone bundle and then destroying the
zone).

In the second case, where we hit the instance-state-monitor endpoint and
get back a NoInstance error, things are equally dire. The
InstanceMonitorRunner task, which is responsible for polling the state
monitor endpoint, will just bail out on receipt of any error from
Propolis:

omicron/sled-agent/src/instance.rs

Lines 253 to 289 in 888f6a1

    
           impl InstanceMonitorRunner { 
        
               async fn run(self) -> Result<(), anyhow::Error> { 
        
                   let mut gen = 0; 
        
                   loop { 
        
                       // State monitoring always returns the most recent state/gen pair 
        
                       // known to Propolis. 
        
                       let response = self 
        
                           .client 
        
                           .instance_state_monitor() 
        
                           .body(propolis_client::types::InstanceStateMonitorRequest { 
        
                               gen, 
        
                           }) 
        
                           .send() 
        
                           .await? 
        
                           .into_inner(); 
        
                       let observed_gen = response.gen; 
        
                       // Now that we have the response from Propolis' HTTP server, we 
        
                       // forward that to the InstanceRunner. 
        
                       // 
        
                       // It will decide the new state, provide that info to Nexus, 
        
                       // and possibly identify if we should terminate. 
        
                       let (tx, rx) = oneshot::channel(); 
        
                       self.tx_monitor 
        
                           .send(InstanceMonitorRequest::Update { state: response, tx }) 
        
                           .await?; 
        
                       if let Reaction::Terminate = rx.await? { 
        
                           return Ok(()); 
        
                       } 
        
                       // Update the generation number we're asking for, to ensure the 
        
                       // Propolis will only return more recent values. 
        
                       gen = observed_gen + 1; 
        
                   } 
        
               } 
        
           }

We would expect this to drop the send-side of the channel it uses to
communicate with the InstanceRunner, closing the channel, and hitting
this select, which would correctly mark the VMM as failed and do what we
want, despite eating the actual error message from propolis:

omicron/sled-agent/src/instance.rs

Lines 388 to 394 in 888f6a1

    
           // NOTE: This case shouldn't really happen, as we keep a copy 
        
           // of the sender alive in "self.tx_monitor". 
        
           None => { 
        
               warn!(self.log, "Instance 'VMM monitor' channel closed; shutting down"); 
        
               let mark_failed = true; 
        
               self.terminate(mark_failed).await; 
        
           },

HOWEVER, as the comment points out, the InstanceRunner is also
holding its own clone of the channel sender, keeping us from ever
hitting that case:

omicron/sled-agent/src/instance.rs

Lines 308 to 309 in 888f6a1

    
           // Request channel on which monitor requests are made. 
        
           tx_monitor: mpsc::Sender<InstanceMonitorRequest>,

AFAICT, this means we'll just totally give up on monitoring the instance
as soon as we hit any error here, which seems...quite bad. I think
that this actually means that when a Propolis process crashes
unexpectedly, we'll get an error when the TCP connection closes, bail
out, and then never try hitting the instance monitor endpoint ever
again¹. So, we won't notice that the Propolis is actually out to
lunch until we try to ask it to change state. This Is Not Great!

This commit fixes both of these cases, by making sled-agent actually
handle Propolis' error codes correctly. I've added a dependency on the
propolis_api_types crate, which is where the error code lives, and
some code in sled-agent to attempt to parse these codes when receiving
an error response from the Propolis client. Now, when we try to PUT a
new state to the instance, we'll look at the error code that comes back,
mark it as Failed if the error indicates that we should do so, and
publish the Failed VMM state to Nexus before tearing down the zone.
The InstanceMonitorTask now behaves similarly, and I've changed it to
retry all other errors with a backoff, rather than just bailing out
immediately on receipt fo the first error.

I've manually tested this on london as discussed here:
#6726 (comment)

Unfortunately, it's hard to do any kind of automated testing of this
with the current test situation, as none of this code is exercised by
the simulated sled-agent.

Fixes #3209
Fixes #3238

Although I've not actually verified that this is what happens. ↩

hawkw · 2024-09-30T19:41:21Z

Okay, it looks like this does work. If I have a running instance, and I zlogin into the oxz_propolis_server zone and restart the Propolis service, I can see the instance go to Failed more or less immediately:

eliza@theseus ~/Code/oxide/oxide.rs $ ssh londonswitch
The illumos Project     helios-2.0.22923        September 2024
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (106.0.0)
ID                                   NAME   STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
2433197b-fd8f-4655-9f49-19339b8a327a labrat running 7d64e906-a0ab-4662-9330-6de0eb691831 30a2fe41-4a2c-44cd-a1f4-7b17c5181371 BRM42220036
root@oxz_switch1:~# pilot host login BRM42220036

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

BRM42220036 # /opt/oxide/oxlog/oxlog zones
global
oxz_cockroachdb_7f8b3453-e625-424b-8306-69507a55e1e4
oxz_crucible_07d6e3ee-b489-4698-b7c7-b2dfc456cd84
oxz_crucible_0974390c-b98c-4116-bca0-f716cda87587
oxz_crucible_1d3b3b88-feaa-4cf3-af31-627fcd1dfb06
oxz_crucible_31d9ddca-164d-49a2-80ee-f047cf7874e6
oxz_crucible_7471f78f-7b3d-4aaf-86ff-779c8644b335
oxz_crucible_8fac3aae-227a-4f62-a06d-e777a3fdbe54
oxz_crucible_c1aa0788-68cb-4083-95d6-cca9c15a8e12
oxz_crucible_f7d1b214-55ef-4db6-91ad-cdec7a367169
oxz_crucible_f8c20512-e9da-4329-8569-be0733a70217
oxz_crucible_pantry_99df66d0-bccc-4d44-b793-164036844faa
oxz_external_dns_a73b9c9c-fa12-453c-9546-0ddc96411674
oxz_internal_dns_63cb81fd-47fd-4065-9fc8-e0df2f85f396
oxz_ntp_a2995263-07d9-48d5-945d-ef5dbd58fcda
oxz_oximeter_2ab67cd8-2917-4191-8fcf-3e80de09328b
oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
oxz_switch
BRM42220036 # zlogin oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
[Connected to zone 'oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831' pts/5]
The illumos Project     helios-2.0.22923        September 2024
root@oxz_propolis:~# svcs
STATE          STIME    FMRI
legacy_run     19:23:58 lrc:/etc/rc2_d/S20sysetup
legacy_run     19:23:58 lrc:/etc/rc2_d/S89PRESERVE
online         19:23:52 svc:/system/svc/restarter:default
online         19:23:53 svc:/system/early-manifest-import:default
online         19:23:53 svc:/system/filesystem/root:default
online         19:23:53 svc:/network/ip-interface-management:default
online         19:23:53 svc:/network/netcfg:default
online         19:23:53 svc:/network/datalink-management:default
online         19:23:53 svc:/system/boot-archive:default
online         19:23:54 svc:/network/loopback:default
online         19:23:54 svc:/system/filesystem/usr:default
online         19:23:54 svc:/system/device/local:default
online         19:23:54 svc:/network/physical:default
online         19:23:54 svc:/milestone/devices:default
online         19:23:54 svc:/network/ipsec/ipsecalgs:default
online         19:23:54 svc:/system/identity:node
online         19:23:54 svc:/system/device/audio:default
online         19:23:54 svc:/system/filesystem/minimal:default
online         19:23:54 svc:/network/ipsec/policy:default
online         19:23:54 svc:/system/rmtmpfiles:default
online         19:23:54 svc:/milestone/network:default
online         19:23:54 svc:/network/initial:default
online         19:23:54 svc:/network/iptun:default
online         19:23:54 svc:/system/hostid:default
online         19:23:54 svc:/network/netmask:default
online         19:23:54 svc:/system/auditset:default
online         19:23:54 svc:/system/logadm-upgrade:default
online         19:23:54 svc:/system/pfexec:default
online         19:23:54 svc:/system/pkgserv:default
online         19:23:54 svc:/system/fmd:default
online         19:23:56 svc:/system/rbac:default
online         19:23:57 svc:/network/service:default
online         19:23:57 svc:/system/cryptosvc:default
online         19:23:57 svc:/system/identity:domain
online         19:23:57 svc:/system/keymap:default
online         19:23:57 svc:/system/name-service-cache:default
online         19:23:57 svc:/milestone/name-services:default
online         19:23:57 svc:/system/manifest-import:default
online         19:23:57 svc:/milestone/single-user:default
online         19:23:57 svc:/system/coreadm:default
online         19:23:57 svc:/milestone/sysconfig:default
online         19:23:57 svc:/system/filesystem/local:default
online         19:23:57 svc:/system/utmp:default
online         19:23:58 svc:/system/system-log:default
online         19:23:58 svc:/system/cron:default
online         19:23:58 svc:/system/console-login:default
online         19:23:58 svc:/system/boot-archive-update:default
online         19:23:58 svc:/network/shares/group:default
online         19:23:58 svc:/milestone/multi-user:default
online         19:23:58 svc:/system/boot-config:default
online         19:23:58 svc:/network/shares/group:zfs
online         19:23:58 svc:/network/routing-setup:default
online         19:23:58 svc:/system/update-man-index:default
online         19:23:58 svc:/network/ssh:default
online         19:23:58 svc:/milestone/multi-user-server:default
online         19:23:58 svc:/system/illumos/propolis-server:default
online         19:24:08 svc:/network/routing/ndp:default
root@oxz_propolis:~# svcadm restart svc:/system/illumos/propolis-server:default
root@oxz_propolis:~#
[Connection to zone 'oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831' pts/5 closed]
BRM42220036 # exit
logout
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (106.0.0)
ID                                   NAME   STATE  PROPOLIS_ID          SLED_ID           HOST_SERIAL
2433197b-fd8f-4655-9f49-19339b8a327a labrat failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~#

Meanwhile, in the sled-agent logs, we can see it observe the Propolis go away, and recognize the error code from the restarted Propolis as indicating VM-gone-ness:

19:29:43.487Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:267
    generation = 3
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    retry_in = 59.516814ms
19:29:43.548Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:267
    generation = 3
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    retry_in = 143.190326ms
19:29:45.050Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:267
    generation = 3
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    retry_in = 177.540776ms
19:29:45.231Z INFO SledAgent (InstanceManager): Error code from Propolis after instance initialized, assuming it's gone
    error = NoInstance
    file = sled-agent/src/instance.rs:446
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
19:29:45.231Z INFO SledAgent (StorageManager): Received GetLatestResources(...)
    file = sled-storage/src/manager.rs:451
19:29:45.231Z WARN SledAgent: target removed with queued samples
    file = oximeter/instruments/src/kstat/sampler.rs:602
    id = 7542715260913182298
    n_samples = 18
    sled_id = 30a2fe41-4a2c-44cd-a1f4-7b17c5181371
19:29:45.231Z INFO SledAgent (StorageManager): Received GetLatestResources(...)
    file = sled-storage/src/manager.rs:451
19:29:45.231Z INFO SledAgent (ZoneBundler): creating zone bundle
    context = ZoneBundleContext { storage_dirs: ["/pool/int/88e574a7-1a37-47ae-a29c-34fe1d60353a/debug/bundle/zone", "/pool/int/a15c936d-bf01-4411-aab2-d401b66910c6/debug/bundle/zone"], cause: TerminatedInstance, extra_log_dirs: ["/pool/ext/35078395-5274-410d-ac65-4b941df7d371/crypt/debug", "/pool/ext/9c13a5e9-e5e0-4b55-b973-66ae886fc8f5/crypt/debug", "/pool/ext/613d06ff-248e-4720-b91a-fa0802bab36d/crypt/debug", "/pool/ext/85e1166a-5217-4a23-b10a-b280d063cdab/crypt/debug", "/pool/ext/0987bf84-86d8-4c01-ab83-908e27a89cc3/crypt/debug", "/pool/ext/ef785d36-c3f3-47ec-ad4d-97915e4769b9/crypt/debug", "/pool/ext/fa041119-3b83-4998-bc02-dfe4ad392887/crypt/debug", "/pool/ext/bd4a215e-f7aa-4b38-94d6-0e3e823bb6e2/crypt/debug", "/pool/ext/fcf8ee81-3fc7-4ba6-987c-bbc4b1b0aa07/crypt/debug"] }
    file = sled-agent/src/zone_bundle.rs:339
    zone_name = oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
19:29:45.232Z WARN SledAgent: target removed with queued samples
    file = oximeter/instruments/src/kstat/sampler.rs:602
    id = 7721952280714464990
    n_samples = 18
    sled_id = 30a2fe41-4a2c-44cd-a1f4-7b17c5181371
19:29:45.583Z INFO SledAgent: request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 6724
    local_addr = [fd00:1122:3344:101::1]:54111
    method = GET
    remote_addr = [fd00:1122:3344:101::5]:38069
    req_id = b660d4d8-46a9-44b0-93b7-7bd36a81a549
    response_code = 200
    sled_id = 30a2fe41-4a2c-44cd-a1f4-7b17c5181371
    uri = /30a2fe41-4a2c-44cd-a1f4-7b17c5181371
19:29:46.187Z INFO SledAgent (ZoneBundler): finished zone bundle
    file = sled-agent/src/zone_bundle.rs:1142
    metadata = ZoneBundleMetadata { id: ZoneBundleId { zone_name: "oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831", bundle_id: aefd1ca0-c095-4245-989e-cb5f2fad1584 }, time_created: 2024-09-30T19:29:45.232807722Z, version: 0, cause: TerminatedInstance }
19:29:46.187Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_7d64e906-a0ab-4662-9330-6de0eb691831
    file = sled-agent/src/instance.rs:922
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
19:29:49.039Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Running
    file = illumos-utils/src/zone.rs:270
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
19:29:49.059Z INFO SledAgent (PortManager): Removed route set for subnet
    file = illumos-utils/src/opte/port_manager.rs:904
    id = RouterId { vni: Vni(14782550), kind: System }
19:29:49.059Z INFO SledAgent (PortManager): Removed route set for subnet
    file = illumos-utils/src/opte/port_manager.rs:904
    id = RouterId { vni: Vni(14782550), kind: Custom(V4(Ipv4Net { addr: 172.30.0.0, width: 22 })) }
19:29:49.060Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:593
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Failed, gen: Generation(5), time_updated: 2024-09-30T19:29:49.060460162Z }, migration_in: None, migration_out: None }
19:29:49.144Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:593
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Failed, gen: Generation(5), time_updated: 2024-09-30T19:29:49.060460162Z }, migration_in: None, migration_out: None }
19:29:49.150Z INFO SledAgent (InstanceManager): State monitoring task complete
    file = sled-agent/src/instance.rs:859
    instance_id = 2433197b-fd8f-4655-9f49-19339b8a327a
    propolis_id = 7d64e906-a0ab-4662-9330-6de0eb691831

And, if we pop back over to BRM2220036, we can see the defunct zone has been removed. So that seems good!

root@oxz_switch1:~# pilot host login BRM42220036

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

BRM42220036 # zoneadm list
global
oxz_switch
oxz_internal_dns_63cb81fd-47fd-4065-9fc8-e0df2f85f396
oxz_ntp_a2995263-07d9-48d5-945d-ef5dbd58fcda
oxz_cockroachdb_7f8b3453-e625-424b-8306-69507a55e1e4
oxz_crucible_f8c20512-e9da-4329-8569-be0733a70217
oxz_crucible_0974390c-b98c-4116-bca0-f716cda87587
oxz_crucible_1d3b3b88-feaa-4cf3-af31-627fcd1dfb06
oxz_crucible_31d9ddca-164d-49a2-80ee-f047cf7874e6
oxz_crucible_7471f78f-7b3d-4aaf-86ff-779c8644b335
oxz_crucible_pantry_99df66d0-bccc-4d44-b793-164036844faa
oxz_crucible_8fac3aae-227a-4f62-a06d-e777a3fdbe54
oxz_crucible_c1aa0788-68cb-4083-95d6-cca9c15a8e12
oxz_crucible_f7d1b214-55ef-4db6-91ad-cdec7a367169
oxz_external_dns_a73b9c9c-fa12-453c-9546-0ddc96411674
oxz_crucible_07d6e3ee-b489-4698-b7c7-b2dfc456cd84
oxz_oximeter_2ab67cd8-2917-4191-8fcf-3e80de09328b
BRM42220036 #

hawkw · 2024-09-30T19:44:06Z

(from the sled-agent logs it does look like it publishes the state generation 5 to nexus twice, which i ought to figure out...)

hawkw · 2024-09-30T19:46:32Z

helios-deploy CI failure looks like timesync badness again: https://buildomat.eng.oxide.computer/wg/0/artefact/01J91ZNH8EG1FZF6P13HVDW7G2/X6nkLZG8fI0eo7zJjaGBT4LAQIcZnulGhaqf5r2iUSG3dbkZ/01J91ZP39V6PYJ5TB4FNYY2JE0/01J922ZA9WCDEF1DX5JBHGGR8X/oxide-sled-agent:default.log?format=x-bunyan#L915

Will restart that and see if it's still busted.

hawkw · 2024-09-30T20:06:33Z

sled-agent/src/instance.rs

+                            // TODO(eliza): this is kinda jank but Propolis
+                            // currently could return the same error when the
+                            // instance has not been ensured and when it's still
+                            // coming up, which is sad. I'm not sure whether
+                            // that happens IRL or not --- investigate this.
+                            let reaction = if has_ever_seen_a_good_state {


This extra junk is because I noticed that Propolis' instance-state-monitor endpoint will presently return the NoInstance error on both VmError::NotCreated and VmError::WaitingToInitialize:

https://github.com/oxidecomputer/propolis/blob/11371b0f3743f8df5b047dc0edc2699f4bdf3927/bin/propolis-server/src/lib/server.rs#L397-L399

This gave me some anxiety about sled-agent potentially deciding that a Propolis has "died" while it's actually still waiting to be born, although I'm not sure whether we would actually see the WaitingToInitialize error there in practice or not. I think the proper thing to do would be to change that behavior in Propolis, but I wanted to get something working in sled-agent regardless.

after actually looking in Propolis, it looks like the Vm::state_watcher() function will never actually return VmError::WaitingToInitialize; it will only return VmError::NotCreated:
https://github.com/oxidecomputer/propolis/blob/11371b0f3743f8df5b047dc0edc2699f4bdf3927/bin/propolis-server/src/lib/vm/mod.rs#L353-L364

So, hitting the state-monitor endpoint should just return a Starting VMM state in that case, which is good. I still don't like that the method can return any of the error types, though, even though it doesn't currently --- I'll go try to clean that up on the Propolis side.

Okay, oxidecomputer/propolis#771 will make the propolis side way less scary. But I think this code can also just be removed as propolis' current behavior is also correct.

hawkw · 2024-09-30T20:09:37Z

sled-agent/src/instance.rs

+                                    "error" => ?code,
+                                );
+                                self.terminate(true).await;
+                                self.publish_state_to_nexus().await;


This is probably why we sent the failed state to nexus twice in my manual testing, so I think it can be removed.

Currently, the `Vm::get()` and `Vm::state_watcher()` methods return a `Result<{something}, VmError>`. This means that they can return _any_ `VmError` variant, but the current implementation of these methods will only ever return `VmError::NotCreated`. This gave me a bit of anxiety when I noticed that we were presently handling the `VmError::WaitingForInit` variant identically to `NotCreated`, which seemed wrong as it implied that we could, potentially, return an error code indicating that no instance has ever been ensured when in fact, one _has_ been ensured but is still being initialized. I freaked out about this a bit here: oxidecomputer/omicron#6726 (comment) It turns out the current behavior is actually fine, since if the VM is still initializing, we still allow accessing it and just return a `Starting` state, which is correct. But the type signatures of these functions allow them to return any of these errors, and forces their callers to handle those errors, and that's the part we were doing somewhat incorrectly (although, again, in practice it doesn't matter). This commit changes those functions to return an `Option` rather than a `Result`. Now, we return `None` if no VM has ever been created, making that the _only_ error case that callers have to handle. There's no longer any risk of the HTTP API handlers accidentally returning a "VM never ensured" error when the VM is ensured-but-initializing. Also, we can remove the `NotCreated` variant, since it's now no longer returned anywhere.

hawkw · 2024-09-30T21:57:04Z

I'm going to redo my manual testing from #6726 (comment) with the latest changes from b59cfc1, but I'm pretty sure that it'll still work. Figured it was worth making sure, though.

ahl

I'd suggest @pfmooney take a look.

Perhaps something like oxidecomputer/dropshot#39 would better allow propolis to communicate actionable errors.

sled-agent/src/instance.rs

Cargo.toml

sled-agent/src/instance.rs

Cargo.toml

sled-agent/src/instance.rs

At present, sled-agent's handling of the error codes used by Propolis to indicate that it has crashed and been restarted is woefully incorrect. In particular, there are two cases where such an error may be encountered by a sled-agent: 1. When attempting to ask the VMM to change state (e.g. reboot or stop the instance) 2. When hitting Propolis' `instance-state-monitor` API endpoint to proactively check the VM's current state Neither of these are handled correctly today, In the first case, if a sled-agent operation on behalf of Nexus encounters a client error from Propolis, it will forward that error code to Nexus...but, _Nexus_ only moves instances to `Failed` in the face of sled-agent's `410 NO_SUCH_INSTANCE`, and _not_ [if it sees the `NoInstance` error code from Propolis][1], which means that the ghosts left behind by crashed and restarted Propolii still won't be moved to `Failed`. Agh. Furthermore, in that case, sled-agent itself will not perform the necessary cleanup actions to deal with the now-defunct Propolis zone (e.g. collecting a zone bundle and then destroying the zone). In the second case, where we hit the instance-state-monitor endpoint and get back a `NoInstance` error, things are equally dire. The `InstanceMonitorRunner` task, which is responsible for polling the state monitor endpoint, will just bail out on receipt of any error from Propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L253-L289 We would _expect_ this to drop the send-side of the channel it uses to communicate with the `InstanceRunner`, closing the channel, and hitting this select, which would correctly mark the VMM as failed and do what we want, despite eating the actual error message from propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L388-L394 HOWEVER, as the comment points out, the `InstanceRunner` is _also_ holding its own clone of the channel sender, keeping us from ever hitting that case: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L308-L309 AFAICT, this means we'll just totally give up on monitoring the instance as soon as we hit any error here, which seems...quite bad. I *think* that this actually means that when a Propolis process crashes unexpectedly, we'll get an error when the TCP connection closes, bail out, and then _never try hitting the instance monitor endpoint ever again_[^1]. So, we won't notice that the Propolis is actually out to lunch until we try to ask it to change state. **This Is Not Great!** This commit fixes both of these cases, by making sled-agent actually handle Propolis' error codes correctly. I've added a dependency on the `propolis_api_types` crate, which is where the error code lives, and some code in sled-agent to attempt to parse these codes when receiving an error response from the Propolis client. Now, when we try to PUT a new state to the instance, we'll look at the error code that comes back, mark it as `Failed` if the error indicates that we should do so, and publish the `Failed` VMM state to Nexus before tearing down the zone. The `InstanceMonitorTask` now behaves similarly, and I've changed it to retry all other errors with a backoff, rather than just bailing out immediately on receipt fo the first error. I've manually tested this on `london` as discussed here: #6726 (comment) Unfortunately, it's hard to do any kind of automated testing of this with the current test situation, as none of this code is exercised by the simulated sled-agent. Fixes #3209 Fixes #3238 [^1]: Although I've not actually verified that this is what happens. [1]: https://github.com/oxidecomputer/propolis/blob/516dabe473cecdc3baea1db98a80441968c144cf/crates/propolis-api-types/src/lib.rs#L434-L441

hawkw · 2024-10-02T21:20:18Z

Okay, I've re-tested this after the Dogfood update to c50cf01, and it definitely still works --- if I restart a propolis_server process, the instance gets reincarnated pretty much immediately.

In the switch zone, inspecting my newly created test instance:

root@oxz_switch0:~# omdb db instance info cb44ae27-befb-424a-b9f1-087ac9d0c898
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (107.0.0)

== INSTANCE ====================================================================
                        ID: cb44ae27-befb-424a-b9f1-087ac9d0c898
                project ID: 50d839b4-2d18-49c0-94cf-7d6883e00178
                      name: test-instance-please-ignore
               description: let's see what happens when propolis crashes
                created at: 2024-10-02 21:10:26.781544 UTC
          last modified at: 2024-10-02 21:10:26.781544 UTC

== CONFIGURATION ===============================================================
                     vCPUs: 2
                    memory: 8 GiB
                  hostname: test-instance-please-ignore
                 boot disk: None
              auto-restart:
                  InstanceAutoRestart {
                      policy: None,
                      cooldown: None,
                  }

== RUNTIME STATE ===============================================================
               nexus state: Vmm
(i)     external API state: Running
           last updated at: 2024-10-02T21:10:26.781544Z (generation 3)
       needs reincarnation: false
             karmic status: bound to saṃsāra
      last reincarnated at: None
             active VMM ID: Some(89437bf4-da6b-441d-bc9c-c9344b6893e3)
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 1

         active VMM record:
             Vmm {
                 id: 89437bf4-da6b-441d-bc9c-c9344b6893e3,
                 time_created: 2024-10-02T21:10:30.927858Z,
                 time_deleted: None,
                 instance_id: cb44ae27-befb-424a-b9f1-087ac9d0c898,
                 sled_id: 5f6720b8-8a31-45f8-8c94-8e699218f28b,
                 propolis_ip: V6(
                     Ipv6Network {
                         addr: fd00:1122:3344:109::1:4f6,
                         prefix: 128,
                     },
                 ),
                 propolis_port: SqlU16(
                     12400,
                 ),
                 runtime: VmmRuntimeState {
                     time_state_updated: 2024-10-02T21:10:41.851556Z,
                     gen: Generation(
                         Generation(
                             4,
                         ),
                     ),
                     state: Running,
                 },
             }
root@oxz_switch0:~#

log into the instance's sled and restart its propolis-server

root@oxz_switch0:~# pilot host login BRM42220017

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

BRM42220017 # zlogin oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3
[Connected to zone 'oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3' pts/3]
root@oxz_propolis:~# svcadm restart svc:/system/illumos/propolis-server:default
root@oxz_propolis:~#
[Connection to zone 'oxz_propolis-server_89437bf4-da6b-441d-bc9c-c9344b6893e3' pts/3 closed]
BRM42220017 # exit
logout
root@oxz_switch0:~#

instance is restarted successfully!

root@oxz_switch0:~# omdb db instance info cb44ae27-befb-424a-b9f1-087ac9d0c898
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (107.0.0)

== INSTANCE ====================================================================
                        ID: cb44ae27-befb-424a-b9f1-087ac9d0c898
                project ID: 50d839b4-2d18-49c0-94cf-7d6883e00178
                      name: test-instance-please-ignore
               description: let's see what happens when propolis crashes
                created at: 2024-10-02 21:10:26.781544 UTC
          last modified at: 2024-10-02 21:10:26.781544 UTC

== CONFIGURATION ===============================================================
                     vCPUs: 2
                    memory: 8 GiB
                  hostname: test-instance-please-ignore
                 boot disk: None
              auto-restart:
                  InstanceAutoRestart {
                      policy: None,
                      cooldown: None,
                  }

== RUNTIME STATE ===============================================================
               nexus state: Vmm
(i)     external API state: Starting
           last updated at: 2024-10-02T21:14:44.002990Z (generation 5)
       needs reincarnation: false
/!\          karmic status: cooling down (TimeDelta { secs: 3589, nanos: 929383990 } remaining)
      last reincarnated at: Some(2024-10-02T21:14:45.996494Z)
             active VMM ID: Some(9111dfe7-0d55-4c6b-a72c-1f35ba9b5fc2)
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 4

         active VMM record:
             Vmm {
                 id: 9111dfe7-0d55-4c6b-a72c-1f35ba9b5fc2,
                 time_created: 2024-10-02T21:14:45.918847Z,
                 time_deleted: None,
                 instance_id: cb44ae27-befb-424a-b9f1-087ac9d0c898,
                 sled_id: f15774c1-b8e5-434f-a493-ec43f96cba06,
                 propolis_ip: V6(
                     Ipv6Network {
                         addr: fd00:1122:3344:105::1:5d5,
                         prefix: 128,
                     },
                 ),
                 propolis_port: SqlU16(
                     12400,
                 ),
                 runtime: VmmRuntimeState {
                     time_state_updated: 2024-10-02T21:14:46.684417Z,
                     gen: Generation(
                         Generation(
                             2,
                         ),
                     ),
                     state: Starting,
                 },
             }
root@oxz_switch0:~#

hawkw added 6 commits September 30, 2024 10:03

ugly first draft

1368d96

wip

be7c584

try to do the right thing

ece0dac

blarg

464e9a1

okay there we go

74030f0

rustfmt and stuff

b3346d9

hawkw mentioned this pull request Sep 30, 2024

test failure: integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates #6727

Closed

hawkw commented Sep 30, 2024

View reviewed changes

hawkw mentioned this pull request Sep 30, 2024

Less anxiety-inducing return values oxidecomputer/propolis#771

Merged

clean things up a bit

b59cfc1

hawkw changed the title ~~[wip] unbreak propolis monitoring in sled-agent~~ [sled-agent] Unbreak handling of Propolis error codes Sep 30, 2024

hawkw requested review from gjcolombo and ahl September 30, 2024 21:16

hawkw marked this pull request as ready for review September 30, 2024 21:16

hawkw requested review from smklein and removed request for ahl September 30, 2024 21:18

ahl reviewed Sep 30, 2024

View reviewed changes

sled-agent/src/instance.rs Show resolved Hide resolved

Cargo.toml Outdated Show resolved Hide resolved

sled-agent/src/instance.rs Show resolved Hide resolved

gjcolombo approved these changes Sep 30, 2024

View reviewed changes

smklein approved these changes Sep 30, 2024

View reviewed changes

sled-agent/src/instance.rs Outdated Show resolved Hide resolved

Cargo.toml Outdated Show resolved Hide resolved

sled-agent/src/instance.rs Outdated Show resolved Hide resolved

comment fixup suggested by @smklein

2758f97

hawkw enabled auto-merge (squash) October 1, 2024 17:06

hawkw disabled auto-merge October 1, 2024 17:06

hawkw enabled auto-merge (squash) October 1, 2024 17:06

Merge branch 'main' into eliza/dude-wheres-my-instance

ef41f66

hawkw mentioned this pull request Oct 1, 2024

structured error codes should be in propolis-client oxidecomputer/propolis#775

Open

hawkw self-assigned this Oct 1, 2024

hawkw added this to the 11 milestone Oct 1, 2024

hawkw merged commit 3093818 into main Oct 1, 2024
18 checks passed

hawkw deleted the eliza/dude-wheres-my-instance branch October 1, 2024 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled-agent] Unbreak handling of Propolis error codes #6726

[sled-agent] Unbreak handling of Propolis error codes #6726

hawkw commented Sep 30, 2024 •

edited

Loading

hawkw commented Sep 30, 2024 •

edited

Loading

hawkw commented Sep 30, 2024

hawkw commented Sep 30, 2024

hawkw Sep 30, 2024

hawkw Sep 30, 2024

hawkw Sep 30, 2024

hawkw Sep 30, 2024

hawkw commented Sep 30, 2024

ahl left a comment

hawkw commented Oct 2, 2024 •

edited

Loading

	impl InstanceMonitorRunner {
	async fn run(self) -> Result<(), anyhow::Error> {
	let mut gen = 0;
	loop {
	// State monitoring always returns the most recent state/gen pair
	// known to Propolis.
	let response = self
	.client
	.instance_state_monitor()
	.body(propolis_client::types::InstanceStateMonitorRequest {
	gen,
	})
	.send()
	.await?
	.into_inner();
	let observed_gen = response.gen;

	// Now that we have the response from Propolis' HTTP server, we
	// forward that to the InstanceRunner.
	//
	// It will decide the new state, provide that info to Nexus,
	// and possibly identify if we should terminate.
	let (tx, rx) = oneshot::channel();
	self.tx_monitor
	.send(InstanceMonitorRequest::Update { state: response, tx })
	.await?;

	if let Reaction::Terminate = rx.await? {
	return Ok(());
	}

	// Update the generation number we're asking for, to ensure the
	// Propolis will only return more recent values.
	gen = observed_gen + 1;
	}
	}
	}

	// NOTE: This case shouldn't really happen, as we keep a copy
	// of the sender alive in "self.tx_monitor".
	None => {
	warn!(self.log, "Instance 'VMM monitor' channel closed; shutting down");
	let mark_failed = true;
	self.terminate(mark_failed).await;
	},

	// Request channel on which monitor requests are made.
	tx_monitor: mpsc::Sender<InstanceMonitorRequest>,

[sled-agent] Unbreak handling of Propolis error codes #6726

[sled-agent] Unbreak handling of Propolis error codes #6726

Conversation

hawkw commented Sep 30, 2024 • edited Loading

Footnotes

hawkw commented Sep 30, 2024 • edited Loading

hawkw commented Sep 30, 2024

hawkw commented Sep 30, 2024

hawkw Sep 30, 2024

Choose a reason for hiding this comment

hawkw Sep 30, 2024

Choose a reason for hiding this comment

hawkw Sep 30, 2024

Choose a reason for hiding this comment

hawkw Sep 30, 2024

Choose a reason for hiding this comment

hawkw commented Sep 30, 2024

ahl left a comment

Choose a reason for hiding this comment

hawkw commented Oct 2, 2024 • edited Loading

hawkw commented Sep 30, 2024 •

edited

Loading

hawkw commented Sep 30, 2024 •

edited

Loading

hawkw commented Oct 2, 2024 •

edited

Loading