Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] handle sled-agent errors as described in RFD 486 #6455

Merged
merged 25 commits into from
Sep 2, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 27, 2024

RFD 486 describes a principled design for how to use the Failed state
for instances and VMMs. In particular, it proposes that:

  • Instances and VMMs should move to the Failed state independently.
  • A VMM goes to Failed if either:
    • Nexus observes that a VMM it thought was resident on a sled is no
      longer resident there, or,
    • it observes through some other means (sled expungement, sled reboot)
      that all VMMs that were resident on that sled must now be gone.
  • Instances go to Failed when an instance-update saga observes that
    an instance's active VMM has been marked Failed. An update saga
    transitioning an instance to failed cleans up the instance's resource
    allocations, similarly to how update sagas handle transitioning an
    instance to Destroyed when its active VMM has been destroyed.
  • For purposes of the external API:
    • An instance that is itself in the Failed state appears externally
      to be Failed.
    • If an instance in the "has-VMM" internal state and points to a
      Failed VMM, the instance is Stopping and then transitions to
      Failed.
    • Once the actual instance (not just its active VMM) is Failed, it
      can be started or destroyed.

This branch implements the behavior described above.

In particular, I've added code to the instance-update saga to handle
instances whose active or migration target VMMs have transitioned to
Failed, similarly to how it handles the VMMs transitioning to
Destroyed. I've changed the code that detects whether a sled-agent
error indicates that an instance has failed to only do so when the error
is a 404 with the specific error code that indicates that the sled-agent
has forgotten about a VMM that it previously knew about, and changed the
Nexus functions that ask sled-agents to update an instance's state, and
the instance_watcher background task, to mark instances as Failed
when they encounter only the appropriate sled-agent error. And, I've
changed the mark_instance_failed function in Nexus to instead mark the
VMM record as failed and trigger an instance-update saga (and it's now
named mark_vmm_failed), and made similar changes to the
instance_watcher background task. Finally, I've also made sure that
Failed instances can be destroyed and restarted, and that instances
with Failed active VMMs appear to be Stopping until the VMM is
cleaned up by an update saga.

In addition to this, it was necessary to (re)introduce a
VmmState::Creating variant. The purpose of this variant is to
differentiate between VMMs which have been created in the database but
do not yet exist on a sled, from those which are known to sled-agent and
are actually starting/migrating.

This is necessary because presently, the instance_watcher background
task will attempt to perform health checks for Starting and
Migrating VMMs, and --- if the sled-agent does not actually know about
the VMM yet --- will move it to Failed. But, when a VMM is newly
created, the sled-agent won't know about it yet until it has been
registered with the sled-agent, so if the instance_watcher task
activates while an instance-start or instance-migrate saga is in
flight, the instance-watcher task may incorrectly move its VMM to
Failed if the VMM record has been created but not yet registered. Now,
after adding the Creating variant, we can avoid performing health
checks for those VMMs until the sled-agent actually acknowledges a
request to register the VMM.

Besides the behavior implemented here, RFD 486 also proposes that the
boot_on_fault field in the instance record be used by Nexus to
determine whether to automatically attempt to restart a Failed
instance. This will be implemented in a subsequent branch --- see issue
#6491 (and also #4872).

@hawkw hawkw force-pushed the eliza/486-sa-errors branch from 997f385 to d1e7165 Compare August 29, 2024 18:58
@hawkw hawkw changed the title [WIP] implement RFD 486 [nexus] implement RFD 486 Aug 29, 2024
@hawkw hawkw requested a review from gjcolombo August 29, 2024 22:39
@hawkw hawkw force-pushed the eliza/486-sa-errors branch from 0b61ff4 to b811827 Compare August 30, 2024 18:14
Comment on lines 102 to 139
match value.0 {
// Errors communicating with the sled-agent should return 502 Bad
// Gateway to the caller.
progenitor_client::Error::CommunicationError(e) => {
// N.B.: it sure *seems* like we should be using the
// `HttpError::for_status` constructor here, but that
// constructor secretly calls the `for_client_error`
// constructor, which panics if we pass a status code that isn't
// a 4xx error. So, instead, we construct an internal error and
// then munge its status code.
let mut error = HttpError::for_internal_error(e.to_string());
error.status_code = if e.is_timeout() {
// This isn't actually in the RFD, but I thought it might be
// nice to return 504 Gateway Timeout when the communication
// error is a timeout...
http::StatusCode::GATEWAY_TIMEOUT
} else {
http::StatusCode::BAD_GATEWAY
};
error
}
// Invalid request errors from the sled-agent should return 500
// Internal Server Errors.
progenitor_client::Error::InvalidRequest(s) => {
HttpError::for_internal_error(s)
}
e => HttpError::for_internal_error(e.to_string()),
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gjcolombo do you think it's worth adding an external message to any of these errors? I was thinking that if self.instance_unhealthy() is true, it might be nice to return an error with a message like "instance failure detected" (and perhaps that should be a 503 Service Unavailable rather than 500?), and maybe the 502s and 504s should say "error communicating with sled-agent" or something? I dunno...

@hawkw hawkw changed the title [nexus] implement RFD 486 [nexus] handle sled-agent errors as described in RFD 486 Aug 30, 2024
@hawkw hawkw marked this pull request as ready for review August 30, 2024 18:15
@hawkw
Copy link
Member Author

hawkw commented Aug 30, 2024

Noticed that I had forgotten to make instances with Failed active VMMs candidates for the background instance-updater task, so I'm doing that now as well.

Edit: done in 8c43ba1

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff here--this was much more straightforward than I had anticipated when I wrote the RFD! I have a few nits and small fixes to suggest but nothing blocking.

nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/instance_common.rs Outdated Show resolved Hide resolved
nexus/db-model/src/vmm_state.rs Show resolved Hide resolved
clients/nexus-client/src/lib.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/vmm.rs Outdated Show resolved Hide resolved
@hawkw hawkw requested a review from smklein August 30, 2024 22:24
@hawkw
Copy link
Member Author

hawkw commented Aug 30, 2024

@gjcolombo okay, I think I've addressed all your suggestions!

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes LGTM!

hawkw added 13 commits September 1, 2024 13:17
Now, it just moves the `vmm` record to `Failed`, and triggers an update
saga, which will (eventually) advance the instance record to failed.
But I still need to do that part :)

Also, some error type munging, which I don't actually need yet.
This commit introduces a new `VmmState` variant, `Creating`. The purpose
of this variant is to differentiate between VMMs which have been created
in the database but do not yet exist on a sled, from those which are
known to sled-agent and are actually starting/migrating.

This is necessary because presently, the `instance-watcher` background
task will attempt to perform health checks for `Starting` and
`Migrating` VMMs, and --- if the sled-agent does not actually know about
the VMM yet --- will move it to `Failed`. But, when a VMM is newly
created, the sled-agent won't know about it yet until it has been
registered with the sled-agent, so if the `instance-watcher` task
activates while an `instance-start` or `instance-migrate` saga is in
flight, the `instance-watcher`` task may incorrectly move its VMM to
`Failed` if the VMM record has been created but not yet registered. Now,
after adding the `Creating` variant, we can avoid performing health
checks for those VMMs until the sled-agent actually acknowledges a
request to register the VMM.
@hawkw hawkw force-pushed the eliza/486-sa-errors branch from 01454f5 to 1b3fcad Compare September 1, 2024 20:17
@hawkw
Copy link
Member Author

hawkw commented Sep 1, 2024

I've manually verified this change on madrid by creating an instance, power-cycling the sled that it's on, watching the instance go to Failed, and then restarting it. Everything seems to work as expected.

On madrid in the switch zone, I run omdb db instances, and see that my instance is in the Running state, and on sled BRM42220007:

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running 7f62223f-8d31-419e-a4e2-1cff197f9879 4d86e152-aa9b-4bb5-a1ea-9b4161cbb81e BRM42220007

Then, I power-cycle that sled. Note that I didn't think long enough to realize that the sled I was about to power cycle was the scrimlet I was currently connected to, so I then had to wait for it to come back... 😅

root@oxz_switch1:~# pilot sp cycle BRM42220007
16  BRM42220007        client_loop: send disconnect: Broken pipe
eliza@jeeves ~ $ ssh madridswitch
(root@fe80::aa40:25ff:fe05:602%madrid_sw1tp0) Password:

eliza@jeeves ~ $ /staff/lab/madrid/enable-tp-announce.sh
+ ssh madridgz bash
+ mkdir -p /usr/share/compliance
+ pilot gimlet info -j
+ cat
+ zlogin oxz_switch bash /tmp/enable-tp-announce.sh
+ sed -i 's/^PASSREQ/#&/' /etc/default/login
+ sed -i /PermitRoot/s/no/yes/ /etc/ssh/sshd_config
+ sed -i /PermitEmpty/s/no/yes/ /etc/ssh/sshd_config
+ sed -i /PasswordAuth/s/no/yes/ /etc/ssh/sshd_config
+ sed -i '/AllowUsers/s/$/ root/' /etc/ssh/sshd_config
+ passwd -d root
passwd: password information changed for root
+ svcadm restart ssh
+ nohup pilot techport announce techport0 techport1
eliza@jeeves ~ $ ssh madridswitch
The illumos Project     helios-2.0.22845        August 2024

Now that the scrimlet is back up, I run omdb db instances again, and note that the instance has now transitioned to Failed:

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE  PROPOLIS_ID          SLED_ID           HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~#

Back on my laptop, I restart the failed instance:

eliza@theseus ~/Code/oxide/oxide.rs $ oxide instance start --instance 8cb8e347-99bf-4769-a5db-022a14cb0fed
{
  "description": "",
  "hostname": "my-instance",
  "id": "8cb8e347-99bf-4769-a5db-022a14cb0fed",
  "memory": 8589934592,
  "name": "my-instance",
  "ncpus": 2,
  "project_id": "b432681c-fadc-479c-9b45-40043cceab25",
  "run_state": "starting",
  "time_created": "2024-09-01T21:19:52.057567Z",
  "time_modified": "2024-09-01T21:19:52.057567Z",
  "time_run_state_updated": "2024-09-01T21:33:07.391957Z"
}

Finally, back on the scrimlet, we can see the instance has been restarted successfully, and is now on a different sled.

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~#

@hawkw
Copy link
Member Author

hawkw commented Sep 1, 2024

Power-cycling BRM42220004 also results in the instance going to Failed, although it takes a little bit for the instance-watcher background task to have actually run:

root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~# pilot sp cycle BRM42220004
17  BRM42220004        ok
root@oxz_switch1:~# pilot sp status BRM42220004
17  BRM42220004        on (A0)
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance running f10379bd-cb58-4b21-882a-d0916a31a7f8 8e47468a-fee0-4a1d-ae4c-1552a7b873b0 BRM42220004
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:102::3]:32221,[fd00:1122:3344:103::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (94.0.0)
ID                                   NAME        STATE  PROPOLIS_ID          SLED_ID           HOST_SERIAL
8cb8e347-99bf-4769-a5db-022a14cb0fed my-instance failed <no active Propolis> <not on any sled> -
root@oxz_switch1:~#

@hawkw hawkw merged commit 46eb2b4 into main Sep 2, 2024
22 checks passed
@hawkw hawkw deleted the eliza/486-sa-errors branch September 2, 2024 18:41
hawkw added a commit that referenced this pull request Sep 23, 2024
As of #6455, instances in the `Failed` state are allowed to be
restarted. As described in [RFD 486 § 4.3], Nexus should automatically
restart such instances when the instance record's `auto_restart_policy`
field (née `boot_on_fault`, see #6499) indicates that it is permitted to
do so. This branch implements this behavior, by adding a new
`instance_reincarnation` RPW. This RPW consists of a background task
which queries CRDB for instances in the `Failed` state which are
eligible to be automatically restarted, and starts a new
`instance-start` saga for any instances it discovers.

Instances are considered eligible for reincarnation if all of the
following conditions are true:

- **The instance is in the `InstanceState::Failed` state.** Meaning, it
  must have *no active VMM* --- if the instance is in the
  `InstanceState::Vmm` state with an active VMM which is in
  `VmmState::Failed`, that indicates no instance-update saga has run to
  clean up the old VMM's resources. In this case, we must wait until the
  instance record itself has transitioned to `Failed` before attempting
  to restart it.
- **The instance's `auto_restart_policy` permits it to be restarted.**
  The `auto_restart_policy` enum is changed from representing the
  classess of failures on which an instance may be restarted (`never`,
  `sled_failures_only`, and `all_failures`) was changed to a more
  general quality-of-service for automatic restarts. Presently, this can
  either be `never`, meaning that the instance can never be restarted
  automatically; or `best_effort`, meaning that the control plane should
  try to keep the instance running but is permitted to not restart it if
  necessary to preserve the stability of the whole system. The default
  policy for instances which don't provide one is `best_effort`.
- **A cooldown period has elapsed since the instance's last automatic
  restart.**. In order to prevent instances which fail frequently from
  compromising the reliability of the system as a whole, we enforce a
  cooldown period between automatic restarts. This is tracked by setting
  a last-auto-restart timestamp in the `instance` record. At present,
  the cooldown period is always one hour unless the instance record
  overrides it, which currently can only be done using a test-only Nexus
  method. In the future, we may consider allowing the cooldown period to
  be configured by users, but because it presents a denial-of-service
  risk, setting it should probably be a more priveliged operation than
  creating an instance.

The implementation of the actual reincarnation background task is
relatively straightforward: it runs a new
`DataStore::find_reincarnatable_instances` query, which returns
instances in the `Failed` state, and it spawns `instance-start` sagas
for any instances which satisfy the above conditions. The restart
cooldown is implemented by adding a new `time_last_auto_restarted` field
to the `instance` table and to the `InstanceRuntimeState` type. We add a
`Reason` enum to the `instance-start` saga's `Params` that indicates why
the instance is being started, and if it is `Reason::AutoRestart`, then
the start saga will set the `time_last_auto_restarted` timestamp when
moving the instance record to the `Starting` state. Additionally, I've
added the `auto_restart_policy` field to the `params::InstanceCreate`
type in the Nexus API so that instances can be created with auto-restart
policies. Reviewers should note that many of the files changed in this
diff were only modified by adding that field to a
`params::InstanceCreate` literal.

This branch adds a minimal, initial implementation of instance
auto-restarting.

[RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances

Fixes #5553
hawkw added a commit that referenced this pull request Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
hawkw added a commit that referenced this pull request Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
hawkw added a commit that referenced this pull request Sep 24, 2024
Currently, instances whose active VMM is `SagaUnwound` appear externally
as `Stopped`. We decided to report them as `Stopped` because start sagas
are permitted to run for instances with `SagaUnwound` active VMMs, and
--- at the time when the `SagaUnwound` VMM state was introduced,
`Failed` instances could not be started. However, #6455 added the
ability to restart `Failed` instances, and #6652 will permit them to be
stopped. Therefore, we should recast instances with `SagaUnwound` active
VMMs as `Failed`: they weren't asked politely to stop; instead, we
attempted to start them and something went wrong...which sounds like
`Failed` to me.

This becomes more important in light of #6638: if we will attempt
automatically restart such instances, they should definitely appear
to be `Failed`. The distinction between `Failed` and `Stopped` becomes
that `Failed` means "this thing isn't running, but it's supposed to be;
we may try to fix that for you if permitted to do so", while `Stopped`
means "this thing isn't running and that's fine, because you asked for
it to no longer be running". Thus, this commit changes `SagaUnwound`
VMMs to appear `Failed` externally.
hawkw added a commit that referenced this pull request Sep 24, 2024
PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640
Fixes #2825, along with #6455 (which allowed restarting `Failed`
instances).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants