-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] handle sled-agent errors as described in RFD 486 #6455
Conversation
997f385
to
d1e7165
Compare
0b61ff4
to
b811827
Compare
nexus/src/app/instance.rs
Outdated
match value.0 { | ||
// Errors communicating with the sled-agent should return 502 Bad | ||
// Gateway to the caller. | ||
progenitor_client::Error::CommunicationError(e) => { | ||
// N.B.: it sure *seems* like we should be using the | ||
// `HttpError::for_status` constructor here, but that | ||
// constructor secretly calls the `for_client_error` | ||
// constructor, which panics if we pass a status code that isn't | ||
// a 4xx error. So, instead, we construct an internal error and | ||
// then munge its status code. | ||
let mut error = HttpError::for_internal_error(e.to_string()); | ||
error.status_code = if e.is_timeout() { | ||
// This isn't actually in the RFD, but I thought it might be | ||
// nice to return 504 Gateway Timeout when the communication | ||
// error is a timeout... | ||
http::StatusCode::GATEWAY_TIMEOUT | ||
} else { | ||
http::StatusCode::BAD_GATEWAY | ||
}; | ||
error | ||
} | ||
// Invalid request errors from the sled-agent should return 500 | ||
// Internal Server Errors. | ||
progenitor_client::Error::InvalidRequest(s) => { | ||
HttpError::for_internal_error(s) | ||
} | ||
e => HttpError::for_internal_error(e.to_string()), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gjcolombo do you think it's worth adding an external message to any of these errors? I was thinking that if self.instance_unhealthy()
is true
, it might be nice to return an error with a message like "instance failure detected" (and perhaps that should be a 503 Service Unavailable rather than 500?), and maybe the 502s and 504s should say "error communicating with sled-agent" or something? I dunno...
Noticed that I had forgotten to make instances with Edit: done in 8c43ba1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff here--this was much more straightforward than I had anticipated when I wrote the RFD! I have a few nits and small fixes to suggest but nothing blocking.
@gjcolombo okay, I think I've addressed all your suggestions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes LGTM!
Now, it just moves the `vmm` record to `Failed`, and triggers an update saga, which will (eventually) advance the instance record to failed. But I still need to do that part :) Also, some error type munging, which I don't actually need yet.
This commit introduces a new `VmmState` variant, `Creating`. The purpose of this variant is to differentiate between VMMs which have been created in the database but do not yet exist on a sled, from those which are known to sled-agent and are actually starting/migrating. This is necessary because presently, the `instance-watcher` background task will attempt to perform health checks for `Starting` and `Migrating` VMMs, and --- if the sled-agent does not actually know about the VMM yet --- will move it to `Failed`. But, when a VMM is newly created, the sled-agent won't know about it yet until it has been registered with the sled-agent, so if the `instance-watcher` task activates while an `instance-start` or `instance-migrate` saga is in flight, the `instance-watcher`` task may incorrectly move its VMM to `Failed` if the VMM record has been created but not yet registered. Now, after adding the `Creating` variant, we can avoid performing health checks for those VMMs until the sled-agent actually acknowledges a request to register the VMM.
01454f5
to
1b3fcad
Compare
I've manually verified this change on On root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running 7f62223f-8d31-419e-a4e2-1cff197f9879 4d86e152-aa9b-4bb5-a1ea-9b4161cbb81e BRM42220007 Then, I power-cycle that sled. Note that I didn't think long enough to realize that the sled I was about to power cycle was the scrimlet I was currently connected to, so I then had to wait for it to come back... 😅 root@oxz_switch1:~# pilot sp cycle BRM42220007
16 BRM42220007 client_loop: send disconnect: Broken pipe
eliza@jeeves ~ $ ssh madridswitch
(root@fe80::aa40:25ff:fe05:602%madrid_sw1tp0) Password:
eliza@jeeves ~ $ /staff/lab/madrid/enable-tp-announce.sh
+ ssh madridgz bash
+ mkdir -p /usr/share/compliance
+ pilot gimlet info -j
+ cat
+ zlogin oxz_switch bash /tmp/enable-tp-announce.sh
+ sed -i 's/^PASSREQ/#&/' /etc/default/login
+ sed -i /PermitRoot/s/no/yes/ /etc/ssh/sshd_config
+ sed -i /PermitEmpty/s/no/yes/ /etc/ssh/sshd_config
+ sed -i /PasswordAuth/s/no/yes/ /etc/ssh/sshd_config
+ sed -i '/AllowUsers/s/$/ root/' /etc/ssh/sshd_config
+ passwd -d root
passwd: password information changed for root
+ svcadm restart ssh
+ nohup pilot techport announce techport0 techport1
eliza@jeeves ~ $ ssh madridswitch
The illumos Project helios-2.0.22845 August 2024 Now that the scrimlet is back up, I run root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~# Back on my laptop, I restart the failed instance: eliza@theseus ~/Code/oxide/oxide.rs $ oxide instance start --instance 8cb8e347-99bf-4769-a5db-022a14cb0fed
{
"description": "",
"hostname": "my-instance",
"id": "8cb8e347-99bf-4769-a5db-022a14cb0fed",
"memory": 8589934592,
"name": "my-instance",
"ncpus": 2,
"project_id": "b432681c-fadc-479c-9b45-40043cceab25",
"run_state": "starting",
"time_created": "2024-09-01T21:19:52.057567Z",
"time_modified": "2024-09-01T21:19:52.057567Z",
"time_run_state_updated": "2024-09-01T21:33:07.391957Z"
} Finally, back on the scrimlet, we can see the instance has been restarted successfully, and is now on a different sled. root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~# |
Power-cycling root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~# pilot sp cycle BRM42220004
17 BRM42220004 ok
root@oxz_switch1:~# pilot sp status BRM42220004
17 BRM42220004 on (A0)
root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~# omdb db instances
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~# |
As of #6455, instances in the `Failed` state are allowed to be restarted. As described in [RFD 486 § 4.3], Nexus should automatically restart such instances when the instance record's `auto_restart_policy` field (née `boot_on_fault`, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a new `instance_reincarnation` RPW. This RPW consists of a background task which queries CRDB for instances in the `Failed` state which are eligible to be automatically restarted, and starts a new `instance-start` saga for any instances it discovers. Instances are considered eligible for reincarnation if all of the following conditions are true: - **The instance is in the `InstanceState::Failed` state.** Meaning, it must have *no active VMM* --- if the instance is in the `InstanceState::Vmm` state with an active VMM which is in `VmmState::Failed`, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned to `Failed` before attempting to restart it. - **The instance's `auto_restart_policy` permits it to be restarted.** The `auto_restart_policy` enum is changed from representing the classess of failures on which an instance may be restarted (`never`, `sled_failures_only`, and `all_failures`) was changed to a more general quality-of-service for automatic restarts. Presently, this can either be `never`, meaning that the instance can never be restarted automatically; or `best_effort`, meaning that the control plane should try to keep the instance running but is permitted to not restart it if necessary to preserve the stability of the whole system. The default policy for instances which don't provide one is `best_effort`. - **A cooldown period has elapsed since the instance's last automatic restart.**. In order to prevent instances which fail frequently from compromising the reliability of the system as a whole, we enforce a cooldown period between automatic restarts. This is tracked by setting a last-auto-restart timestamp in the `instance` record. At present, the cooldown period is always one hour unless the instance record overrides it, which currently can only be done using a test-only Nexus method. In the future, we may consider allowing the cooldown period to be configured by users, but because it presents a denial-of-service risk, setting it should probably be a more priveliged operation than creating an instance. The implementation of the actual reincarnation background task is relatively straightforward: it runs a new `DataStore::find_reincarnatable_instances` query, which returns instances in the `Failed` state, and it spawns `instance-start` sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a new `time_last_auto_restarted` field to the `instance` table and to the `InstanceRuntimeState` type. We add a `Reason` enum to the `instance-start` saga's `Params` that indicates why the instance is being started, and if it is `Reason::AutoRestart`, then the start saga will set the `time_last_auto_restarted` timestamp when moving the instance record to the `Starting` state. Additionally, I've added the `auto_restart_policy` field to the `params::InstanceCreate` type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to a `params::InstanceCreate` literal. This branch adds a minimal, initial implementation of instance auto-restarting. [RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances Fixes #5553
PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).
PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 I believe this also fixes #2825, along with #6455 (which allowed restarting `Failed` instances).
Currently, instances whose active VMM is `SagaUnwound` appear externally as `Stopped`. We decided to report them as `Stopped` because start sagas are permitted to run for instances with `SagaUnwound` active VMMs, and --- at the time when the `SagaUnwound` VMM state was introduced, `Failed` instances could not be started. However, #6455 added the ability to restart `Failed` instances, and #6652 will permit them to be stopped. Therefore, we should recast instances with `SagaUnwound` active VMMs as `Failed`: they weren't asked politely to stop; instead, we attempted to start them and something went wrong...which sounds like `Failed` to me. This becomes more important in light of #6638: if we will attempt automatically restart such instances, they should definitely appear to be `Failed`. The distinction between `Failed` and `Stopped` becomes that `Failed` means "this thing isn't running, but it's supposed to be; we may try to fix that for you if permitted to do so", while `Stopped` means "this thing isn't running and that's fine, because you asked for it to no longer be running". Thus, this commit changes `SagaUnwound` VMMs to appear `Failed` externally.
PR #6503 changed Nexus to attempt to automatically restart instances which are in the `Failed` state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is `Failed`, as a way to say "stop trying to restart this instance" (as `Stopped` instances are not restarted). This branch changes `Nexus::instance_request_state` and `select_instance_change_action` to permit stopping a `Failed` instance. Fixes #6640 Fixes #2825, along with #6455 (which allowed restarting `Failed` instances).
RFD 486 describes a principled design for how to use the Failed state
for instances and VMMs. In particular, it proposes that:
Failed
state independently.Failed
if either:longer resident there, or,
that all VMMs that were resident on that sled must now be gone.
Failed
when aninstance-update
saga observes thatan instance's active VMM has been marked
Failed
. An update sagatransitioning an instance to failed cleans up the instance's resource
allocations, similarly to how update sagas handle transitioning an
instance to
Destroyed
when its active VMM has been destroyed.Failed
state appears externallyto be
Failed
.Failed
VMM, the instance isStopping
and then transitions toFailed
.Failed
, itcan be started or destroyed.
This branch implements the behavior described above.
In particular, I've added code to the
instance-update
saga to handleinstances whose active or migration target VMMs have transitioned to
Failed
, similarly to how it handles the VMMs transitioning toDestroyed
. I've changed the code that detects whether a sled-agenterror indicates that an instance has failed to only do so when the error
is a 404 with the specific error code that indicates that the sled-agent
has forgotten about a VMM that it previously knew about, and changed the
Nexus functions that ask sled-agents to update an instance's state, and
the
instance_watcher
background task, to mark instances asFailed
when they encounter only the appropriate sled-agent error. And, I've
changed the
mark_instance_failed
function in Nexus to instead mark theVMM record as failed and trigger an instance-update saga (and it's now
named
mark_vmm_failed
), and made similar changes to theinstance_watcher
background task. Finally, I've also made sure thatFailed
instances can be destroyed and restarted, and that instanceswith
Failed
active VMMs appear to beStopping
until the VMM iscleaned up by an update saga.
In addition to this, it was necessary to (re)introduce a
VmmState::Creating
variant. The purpose of this variant is todifferentiate between VMMs which have been created in the database but
do not yet exist on a sled, from those which are known to sled-agent and
are actually starting/migrating.
This is necessary because presently, the
instance_watcher
backgroundtask will attempt to perform health checks for
Starting
andMigrating
VMMs, and --- if the sled-agent does not actually know aboutthe VMM yet --- will move it to
Failed
. But, when a VMM is newlycreated, the sled-agent won't know about it yet until it has been
registered with the sled-agent, so if the
instance_watcher
taskactivates while an
instance-start
orinstance-migrate
saga is inflight, the
instance-watcher
task may incorrectly move its VMM toFailed
if the VMM record has been created but not yet registered. Now,after adding the
Creating
variant, we can avoid performing healthchecks for those VMMs until the sled-agent actually acknowledges a
request to register the VMM.
Besides the behavior implemented here, RFD 486 also proposes that the
boot_on_fault
field in the instance record be used by Nexus todetermine whether to automatically attempt to restart a
Failed
instance. This will be implemented in a subsequent branch --- see issue
#6491 (and also #4872).