Skip to content

Commit

Permalink
[nexus] Fix pagination in instance-by-sled query
Browse files Browse the repository at this point in the history
The `instance_and_vmm_list_by_sled_agent` query in `nexus-db-queries`
returns a list of all instances, their projects, their active VMMs, and
the sleds those active VMMs are running on. This is used by the
`instance_watcher` background task to find VMMs to perform health checks
for. The query is paginated by sled IDs to limit the amount of data
fetched in one query, and so that the returned VMMs are grouped by sled,
allowing the `instance_watcher` task to reuse the same sled-agent client
connection pool for all VMMs on a sled.

Unfortunately, the current pagination is a bit wrong, as @davepacheco
kindly pointed out to me in [this comment][1]:

> This is paginated by just `sled_id`, but that doesn't quite work. The
> pagination marker must be unique for each row. I think you'll want to
> use `paginated_multicolumn` with another id here, maybe either the
> instance id or vmm id? As is, I'd expect this code to break when the
> list of instances returned for a particular sled spans a pagination
> boundary. (What I'd expect would happen there is: we process each page
> fine up through the first page of results containing that sled id in
> it, then we use that sled id as the marker, and the subsequent query
> would fetch results with sled ids `>=` the one given, and so you'd
> miss all the rows for that sled id that weren't on the first page of
> results for that sled id.) I was surprised to see that this is not
> documented on either `Paginator::found_batch()` nor `paginated()` :(

The only reason we haven't encountered this bug yet is that currently,
the batch size for the query is fairly high. However, PR #6527 reduces
the batch size substantially to serve as a limit on the number of
concurrently in flight health checks, meaning we'll see some VMMs not
being health checked any time a sled has more than 16 VMMs on it.

This commit fixes the query by changing it to use
`paginated_multicolumn` to paginate the results by both the sled's ID
*and* the VMM's ID. This is made possible by #6530, which changed
`paginated_multicolumn` to allow the query fragment that's paginated to
be a `FROM` clause that's a `JOIN`, in addition to a single table ---
because we must join the `sled` and `vmm` tables in order to get both
IDs. This change required restructuring the order in which the query is
constructed, due to Diesel Reasons, so I took the liberty to add some
comments while I was moving stuff around.

[1]: #6519 (review)
  • Loading branch information
hawkw committed Sep 6, 2024
1 parent 9c81109 commit 2b18c6a
Show file tree
Hide file tree
Showing 3 changed files with 276 additions and 47 deletions.
Loading

0 comments on commit 2b18c6a

Please sign in to comment.