[nexus] Mark VMMs on `Expunged` sleds as `Failed` #6519

hawkw · 2024-09-04T19:53:45Z

When a sled is expunged, any VMMs previously believed to be on that sled
are definitely no longer present. Currently, expunging a sled will not
do anything to VMMs resident on that sled, and the instances will remain
in the Running state...but it won't be possible to stop them, as their
sled-agent will probably no longer be reachable. In conjunction with
#6503, this commit will implement most of the behavior described in
issue #4872; see this comment for details.

This commit updates the instance-watcher background task to mark any
VMMs on Expunged sleds as Failed, so that those VMMs' instances can
transition to Failed and be restarted or deleted normally. It seems
better to do this in the instance-watcher background task, rather than
in the code paths for actually marking a sled as expunged, because
transitioning a VMM to failed triggers an instance-update saga to move
the instance to failed, and doing all of that in the transaction that
expunges the sled seems unfortunate --- expunging the sled would be a
much longer-running operation, and we'd have to be careful to not roll
it back if one update saga fails or whatever. So, instead, I think it's
nicer to just mark the sled as expunged and then let the background task
do the remaining work. And, the instance-watcher background task
already queries for all instances and their associated sleds, so it
makes sense to do the transition there rather than in the
instance-updater background task, since we'd have to add an additional
JOIN to look at sleds, which seems unfortunate. Also, this way,
VMMs that are marked as Failed due to sled expungement can also be
reflected in the metrics, which seems nice.

This was, overall, pretty straightforward: I've just removed the
SledFilter::InService from the instance_and_vmm_list_by_sled_agent
query,¹, passed the Sled into the check_instance function, and
added a code path to skip the actual HTTP health check if the sled is
Expunged and just go ahead and mark the VMM as Failed. I think it
still makes sense to spawn a task for these VMMs, even though they don't
actually do HTTP requests to the sled-agent, because they will attempt
to run a whole instance-update saga to transition the instance to
Failed, which can take a bit of time. It seems nice to just keep
starting other health checks while that runs, instead of doing it
sequentially.

In addition, commit e238d56
changes the instance-watcher task to only increment the metrics for
InstanceState::Failed when a VMM has actually transitioned to Failed.
Error responses from sled-agent that don't mark a VMM as failed are now
considered "incomplete" checks. It seemed wrong to me to increment
metrics with instance_state="failed" when the instance's state is not
actually Failed.

Apparently there were actually two separate
SledFilter::InService calls, which tripped me up for longer than
I would like to admit... ↩

nexus/src/app/background/tasks/instance_watcher.rs

davepacheco

Cool! Just a few nits on this change.

This was my first time looking at the watcher and so I noticed a few things that aren't made worse by this review but I wanted to mention.

On instance_and_vmm_list_by_sled_agent(): this is paginated by just sled_id, but that doesn't quite work. The pagination marker must be unique for each row. I think you'll want to use paginated_multicolumn with another id here, maybe either the instance id or vmm id? As is, I'd expect this code to break when the list of instances returned for a particular sled spans a pagination boundary. (What I'd expect would happen there is: we process each page fine up through the first page of results containing that sled id in it, then we use that sled id as the marker, and the subsequent query would fetch results with sled ids >= the one given, and so you'd miss all the rows for that sled id that weren't on the first page of results for that sled id.) I was surprised to see that this is not documented on either Paginator::found_batch() nor paginated() :(

I don't see a concurrency limit in the number of tokio tasks that we spin up to check instances. Is that right? That makes me pretty nervous, especially since each task makes individual requests to sled agent.

I'd strike MAX_SLED_AGENTS and use the nexus_db_queries::db::datastore::SQL_BATCH_SIZE constant instead. (When I saw MAX_SLED_AGENTS I thought it was using the pattern we used to use before we had Paginator, where we'd define a "big enough" limit and only ever look at that many things. It may have been at one time, but I see where actually using it as the Paginator batch size. So its not really a max and it also isn't counting sled agents.)

nexus/tests/integration_tests/instances.rs

nexus/db-queries/src/db/datastore/instance.rs

davepacheco · 2024-09-05T17:14:20Z

One more thing: what do you think about adding a live test for this? I thought that'd be nice but as I think about it, I'm not sure what happens if you try to provision an instance on a4x2.

@davepacheco

The `instance_watcher` background task follows a pattern where it queries the database for instances, and then spawns a big pile of Tokio tasks to concurrently perform health checks for those instances. As suggested by @davepacheco in [this comment][1], there should probably be a limit on the number of concurrently running health checks to avoid clobbering the sled-agents with a giant pile of HTTP requests. [1]: #6519 (review)

@davepacheco

The `instance_watcher` background task follows a pattern where it queries the database for instances, and then spawns a big pile of Tokio tasks to concurrently perform health checks for those instances. As suggested by @davepacheco in [this comment][1], there should probably be a limit on the number of concurrently running health checks to avoid clobbering the sled-agents with a giant pile of HTTP requests. This branch sets a global concurrency limit of 16 health checks using a `tokio::sync::Semaphore`. This is a pretty arbirary, and fairly low, limit, but we can bump it up later if it turns out to be a bottleneck. [1]: #6519 (review)

hawkw · 2024-09-05T22:40:03Z

One more thing: what do you think about adding a live test for this? I thought that'd be nice but as I think about it, I'm not sure what happens if you try to provision an instance on a4x2.

I'm not previously familiar with the live tests, but I'm happy to look into it!

@davepacheco

Currently, the `paginated_multicolumn` utility in `nexus_db_queries::pagination` only works when the select expression to paginate is a table, and both columns to order by come from that table. This means that it cannot easily be used to fix the bug in `instance_and_vmm_list_by_sled_agent` that @davepacheco describes in [this comment][1], which would require using `paginated_multicolumn` to paginate on two columns in an inner join expression. This commit changes the giant wad of Diesel type ceremony on `paginated_multicolumn` in order to ~~make it even worse~~ allow expressions which are not tables to be paginated. I've added a test demonstrating that this does, in fact, work. Figuring out how to do this was...certainly an experience which I have had. I think I need to lie down now. [1]: #6519 (review)

@davepacheco

The `instance_and_vmm_list_by_sled_agent` query in `nexus-db-queries` returns a list of all instances, their projects, their active VMMs, and the sleds those active VMMs are running on. This is used by the `instance_watcher` background task to find VMMs to perform health checks for. The query is paginated by sled IDs to limit the amount of data fetched in one query, and so that the returned VMMs are grouped by sled, allowing the `instance_watcher` task to reuse the same sled-agent client connection pool for all VMMs on a sled. Unfortunately, the current pagination is a bit wrong, as @davepacheco kindly pointed out to me in [this comment][1]: > This is paginated by just `sled_id`, but that doesn't quite work. The > pagination marker must be unique for each row. I think you'll want to > use `paginated_multicolumn` with another id here, maybe either the > instance id or vmm id? As is, I'd expect this code to break when the > list of instances returned for a particular sled spans a pagination > boundary. (What I'd expect would happen there is: we process each page > fine up through the first page of results containing that sled id in > it, then we use that sled id as the marker, and the subsequent query > would fetch results with sled ids `>=` the one given, and so you'd > miss all the rows for that sled id that weren't on the first page of > results for that sled id.) I was surprised to see that this is not > documented on either `Paginator::found_batch()` nor `paginated()` :( The only reason we haven't encountered this bug yet is that currently, the batch size for the query is fairly high. However, PR #6527 reduces the batch size substantially to serve as a limit on the number of concurrently in flight health checks, meaning we'll see some VMMs not being health checked any time a sled has more than 16 VMMs on it. This commit fixes the query by changing it to use `paginated_multicolumn` to paginate the results by both the sled's ID *and* the VMM's ID. This is made possible by #6530, which changed `paginated_multicolumn` to allow the query fragment that's paginated to be a `FROM` clause that's a `JOIN`, in addition to a single table --- because we must join the `sled` and `vmm` tables in order to get both IDs. This change required restructuring the order in which the query is constructed, due to Diesel Reasons, so I took the liberty to add some comments while I was moving stuff around. [1]: #6519 (review)

hawkw · 2024-09-06T18:23:55Z

Okay, #6534 fixes the pagination issue @davepacheco described, and #6527 rewrites the instance_watcher task to limit concurrent health checks. I'd like to get those merged, and then rebase this branch to pick up those changes, since they touch most of the code this relies on.

Regarding the suggestion of adding this to the live tests, I believe that a4x2 can't create instances, because there's no nested virtualization. It looks like the live tests can be run either against a4x2 or against a test rack like london or madrid, so perhaps we could add a test for instances on expunged sleds that only runs if the live test is running against a racklette, and not in a4x2? I think this would require extending the existing live-test support code with a way for the test to check if it's running against a4x2 or on real hardware, and adding a mechanism for a test to mark itself as skipped because it wasn't running on the right environment. This could just be something like:

if lc.is_a4x2() {
    return Ok(());
}

// ... the rest of the test ...

but I think it would be nicer if the live-test had a way to mark itself as skipped that was more visible to the user --- I think we could maybe do that using a custom test runner for the live tests, instead of just having the #[live_test] macro generate a #[tokio::test] (cc @sunshowers --- a custom test runner can still be used with Nextest, right?).

I'd like to do this eventually, as I think there's actually a lot of instance-lifecycle-related code that probably deserves live tests. But, I think that support for skipping some tests on environments that don't support them seems like enough work that I'd prefer to do it separately from this branch.

sunshowers · 2024-09-06T18:29:23Z

but I think it would be nicer if the live-test had a way to mark itself as skipped that was more visible to the user --- I think we could maybe do that using a custom test runner for the live tests, instead of just having the #[live_test] macro generate a #[tokio::test] (cc @sunshowers --- a custom test runner can still be used with Nextest, right?).

Yeah -- see the docs on custom test harnesses.

hawkw · 2024-09-07T17:55:39Z

I've rebased this branch to depend on #6527, since that PR rewrites much of the instance_watcher task. I'd like to merge that branch before this one.

hawkw · 2024-09-08T16:49:31Z

Uh-oh: https://buildomat.eng.oxide.computer/wg/0/details/01J76SZVGYT4V4Q4GYJ1YHPGJV/rYofaHUPINu8KjkbcZhrjHRCmC7m3zVhdsRAqsvP8zs2QPkt/01J76T0AB4Y2G58AFVN0WTG8XD#S2008

hawkw · 2024-09-08T16:51:09Z

Uh-oh: https://buildomat.eng.oxide.computer/wg/0/details/01J76SZVGYT4V4Q4GYJ1YHPGJV/rYofaHUPINu8KjkbcZhrjHRCmC7m3zVhdsRAqsvP8zs2QPkt/01J76T0AB4Y2G58AFVN0WTG8XD#S2008

Huh, it looks like time sync didn't happen: https://buildomat.eng.oxide.computer/wg/0/artefact/01J76SZVGYT4V4Q4GYJ1YHPGJV/rYofaHUPINu8KjkbcZhrjHRCmC7m3zVhdsRAqsvP8zs2QPkt/01J76T0AB4Y2G58AFVN0WTG8XD/01J76XG4EZGW2QZ81Y7K1PAV0J/oxide-sled-agent:default.log?format=x-bunyan#L806

I wonder if that's the problem here, or if I've broken something...

@davepacheco

The `instance_and_vmm_list_by_sled_agent` query in `nexus-db-queries` returns a list of all instances, their projects, their active VMMs, and the sleds those active VMMs are running on. This is used by the `instance_watcher` background task to find VMMs to perform health checks for. The query is paginated by sled IDs to limit the amount of data fetched in one query, and so that the returned VMMs are grouped by sled, allowing the `instance_watcher` task to reuse the same sled-agent client connection pool for all VMMs on a sled. Unfortunately, the current pagination is a bit wrong, as @davepacheco kindly pointed out to me in [this comment][1]: > This is paginated by just `sled_id`, but that doesn't quite work. The > pagination marker must be unique for each row. I think you'll want to > use `paginated_multicolumn` with another id here, maybe either the > instance id or vmm id? As is, I'd expect this code to break when the > list of instances returned for a particular sled spans a pagination > boundary. (What I'd expect would happen there is: we process each page > fine up through the first page of results containing that sled id in > it, then we use that sled id as the marker, and the subsequent query > would fetch results with sled ids `>=` the one given, and so you'd > miss all the rows for that sled id that weren't on the first page of > results for that sled id.) I was surprised to see that this is not > documented on either `Paginator::found_batch()` nor `paginated()` :( The only reason we haven't encountered this bug yet is that currently, the batch size for the query is fairly high. However, PR #6527 reduces the batch size substantially to serve as a limit on the number of concurrently in flight health checks, meaning we'll see some VMMs not being health checked any time a sled has more than 16 VMMs on it. This commit fixes the query by changing it to use `paginated_multicolumn` to paginate the results by both the sled's ID *and* the VMM's ID. This is made possible by #6530, which changed `paginated_multicolumn` to allow the query fragment that's paginated to be a `FROM` clause that's a `JOIN`, in addition to a single table --- because we must join the `sled` and `vmm` tables in order to get both IDs. This change required restructuring the order in which the query is constructed, due to Diesel Reasons, so I took the liberty to add some comments while I was moving stuff around. [1]: #6519 (review)

@davepacheco

The `instance_watcher` background task follows a pattern where it queries the database for instances, and then spawns a big pile of Tokio tasks to concurrently perform health checks for those instances. As suggested by @davepacheco in [this comment][1], there should probably be a limit on the number of concurrently running health checks to avoid clobbering the sled-agents with a giant pile of HTTP requests. This branch sets a global concurrency limit of 16 health checks (which is fairly conservative, but we can turn it up later if it turns out to be a bottleneck). The concurrency limit is implemented using the database query's batch size. Previously, this code was written in a slightly-weird dual-loop structure, which was intended specifically to *avoid* the size of the database query batch acting as a concurrency limit: we would read a page of sleds from CRDB, spawn a bunch of health check tasks, and then read the next batch, waiting for the tasks to complete only once all instance records had been read from the database. Now, we can implement a concurrency limit by just...not doing that. We now wait to read the next page of query results until we've run health checks for every instance in the batch, limiting the number of concurrently in flight checks. This has a nice advantage over the naïve approach of using a `tokio::sync::Semaphore` or similar, which each health check task must acquire before proceeding, as the concurrency limit: it also bounds the amount of Nexus' memory used by the instance watcher. If we spawned all the tasks immediately but made them wait to acquire a semaphore permit, there would be a bunch of tasks in memory sitting around doing nothing until the currently in flight tasks completed. With the batch size as concurrency limit approach, we can instead avoid spawning those tasks at all (and, avoid reading stuff from CRDB until we actually need it). [1]: #6519 (review)

Currently, the `instance-watcher` background task will increment the `InstanceState::Failed` metrics on a number of errors that *don't* transition the VMM record to `Failed`. This is wrong and we shouldn't do that.

When a sled is expunged, any VMMs previously believed to be on that sled are definitely no longer present. Currently, expunging a sled will not do anything to VMMs resident on that sled, and the instances will remain in the `Running` state...but it won't be possible to stop them, as their sled-agent will probably no longer be reachable. This commit updates the `instance-watcher` background task to mark any VMMs on `Expunged` sleds as `Failed`, so that those VMMs' instances can transition to `Failed` and be restarted or deleted normally. It seems better to do this in the `instance-watcher` background task, rather than in the code paths for actually marking a sled as expunged, because transitioning a VMM to failed triggers an instance-update saga to move the instance to failed, and doing all of that in the transaction that expunges the sled seems unfortunate --- expunging the sled would be a much longer-running operation, and we'd have to be careful to not roll it back if one update saga fails or whatever. So, instead, I think it's nicer to just mark the sled as expunged and then let the background task do the remaining work. And, the `instance-watcher` background task already queries for all instances *and their associated sleds*, so it makes sense to do the transition there rather than in the `instance-updater` background task, since we'd have to add an additional JOIN to look at sleds, which seems unfortunate. Also, this way, VMMs that are marked as `Failed` due to sled expungement can also be reflected in the metrics, which seems nice. This was, overall, pretty straightforward: I've just removed the `SledFilter::InService` from the `instance_and_vmm_list_by_sled_agent` query,[^1], passed the `Sled` into the `check_instance` function, and added a code path to skip the actual HTTP health check if the sled is `Expunged` and just go ahead and mark the VMM as `Failed`. I think it still makes sense to spawn a task for these VMMs, even though they don't actually do HTTP requests to the sled-agent, because they will attempt to run a whole `instance-update` saga to transition the instance to `Failed`, which can take a bit of time. It seems nice to just keep starting other health checks while that runs, instead of doing it sequentially. [^1]: Apparently there were actually *two* separate `SledFilter::InService` calls, which tripped me up for longer than I would like to admit...

hawkw requested review from davepacheco and gjcolombo September 4, 2024 19:53

gjcolombo reviewed Sep 4, 2024

View reviewed changes

gjcolombo approved these changes Sep 4, 2024

View reviewed changes

davepacheco approved these changes Sep 5, 2024

View reviewed changes

nexus/tests/integration_tests/instances.rs Outdated Show resolved Hide resolved

nexus/db-queries/src/db/datastore/instance.rs Show resolved Hide resolved

This was referenced Sep 5, 2024

[nexus] Add instance_watcher concurrency limit #6527

Merged

[db-queries] Allow join expressions in paginated-multicolumn #6530

Merged

hawkw mentioned this pull request Sep 6, 2024

[nexus] Fix pagination in instance-by-sled query #6534

Merged

hawkw force-pushed the eliza/fail-expunged-vmms branch from 80c87c5 to 2fae232 Compare September 7, 2024 17:54

hawkw added 5 commits September 9, 2024 14:40

[nexus] only report Failed metrics if VMM failed

b2eb5f4

Currently, the `instance-watcher` background task will increment the `InstanceState::Failed` metrics on a number of errors that *don't* transition the VMM record to `Failed`. This is wrong and we shouldn't do that.

activate instance-watcher when expunging a sled

c128617

fix incorect spellign (thanks @gjcolombo)

2ff54b5

whoops i ate the check errors status increment

1c68349

hawkw force-pushed the eliza/fail-expunged-vmms branch from 2fae232 to 1c68349 Compare September 9, 2024 21:43

remove explicit bg task activation

61160bb

hawkw enabled auto-merge (squash) September 9, 2024 22:14

hawkw merged commit ec0d88d into main Sep 9, 2024
16 checks passed

hawkw deleted the eliza/fail-expunged-vmms branch September 9, 2024 23:15

hawkw mentioned this pull request Sep 10, 2024

[nexus] Reincarnate Failed instances #6503

Merged

davepacheco mentioned this pull request Sep 24, 2024

need a way to trigger cleanup and next steps for vanished instances #4872

Closed

davepacheco added this to the 11 milestone Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Mark VMMs on `Expunged` sleds as `Failed` #6519

[nexus] Mark VMMs on `Expunged` sleds as `Failed` #6519

hawkw commented Sep 4, 2024

davepacheco left a comment

davepacheco commented Sep 5, 2024

hawkw commented Sep 5, 2024

hawkw commented Sep 6, 2024

sunshowers commented Sep 6, 2024

hawkw commented Sep 7, 2024

hawkw commented Sep 8, 2024

hawkw commented Sep 8, 2024

[nexus] Mark VMMs on Expunged sleds as Failed #6519

[nexus] Mark VMMs on Expunged sleds as Failed #6519

Conversation

hawkw commented Sep 4, 2024

Footnotes

davepacheco left a comment

Choose a reason for hiding this comment

davepacheco commented Sep 5, 2024

hawkw commented Sep 5, 2024

hawkw commented Sep 6, 2024

sunshowers commented Sep 6, 2024

hawkw commented Sep 7, 2024

hawkw commented Sep 8, 2024

hawkw commented Sep 8, 2024

[nexus] Mark VMMs on `Expunged` sleds as `Failed` #6519

[nexus] Mark VMMs on `Expunged` sleds as `Failed` #6519