-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[nexus] Add
instance_watcher
concurrency limit (#6527)
The `instance_watcher` background task follows a pattern where it queries the database for instances, and then spawns a big pile of Tokio tasks to concurrently perform health checks for those instances. As suggested by @davepacheco in [this comment][1], there should probably be a limit on the number of concurrently running health checks to avoid clobbering the sled-agents with a giant pile of HTTP requests. This branch sets a global concurrency limit of 16 health checks (which is fairly conservative, but we can turn it up later if it turns out to be a bottleneck). The concurrency limit is implemented using the database query's batch size. Previously, this code was written in a slightly-weird dual-loop structure, which was intended specifically to *avoid* the size of the database query batch acting as a concurrency limit: we would read a page of sleds from CRDB, spawn a bunch of health check tasks, and then read the next batch, waiting for the tasks to complete only once all instance records had been read from the database. Now, we can implement a concurrency limit by just...not doing that. We now wait to read the next page of query results until we've run health checks for every instance in the batch, limiting the number of concurrently in flight checks. This has a nice advantage over the naïve approach of using a `tokio::sync::Semaphore` or similar, which each health check task must acquire before proceeding, as the concurrency limit: it also bounds the amount of Nexus' memory used by the instance watcher. If we spawned all the tasks immediately but made them wait to acquire a semaphore permit, there would be a bunch of tasks in memory sitting around doing nothing until the currently in flight tasks completed. With the batch size as concurrency limit approach, we can instead avoid spawning those tasks at all (and, avoid reading stuff from CRDB until we actually need it). [1]: #6519 (review)
- Loading branch information
Showing
1 changed file
with
100 additions
and
71 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters