-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update virtual provisioning counters on instance stop/start #4277
Conversation
…elete Per the discussion in RFD 427, only charge virtual provisioning collections for instances when those instances are running. Charges are taken in the instance start saga and dropped when a sled agent tries to transition an instance to a stopped state. Unlike sled resource charges, provisioning charges are tied to instances, not to VMMs. This ensures that a user is not charged twice for an instance (e.g. for quota purposes) while the instance is migrating. Tests: updated integration tests; manually checked virtual provisioning table values in a dev cluster & checked the values on the utilization graphs.
Huge, yet small! This is awesome. |
// NOTE: I think it's arguably "more correct" to identify that the | ||
// number of CPUs being used by guests at this point is actually "0", | ||
// not "4", because the instance is stopped (same re: RAM usage). | ||
// | ||
// However, for implementation reasons, this is complicated (we have a | ||
// tendency to update the runtime without checking the prior state, which | ||
// makes edge-triggered behavior trickier to notice). | ||
|
||
let virtual_provisioning_collection = datastore | ||
.virtual_provisioning_collection_get(&opctx, project_id) | ||
.await | ||
.unwrap(); | ||
let expected_cpus = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So stoked to see this get cleaned up.
&& new_runtime_state.instance_state.gen | ||
>= db_instance.runtime().gen.0 | ||
{ | ||
self.db_datastore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is made based off of new_runtime_state
(from the sled agent) as well as db_instance.runtime()
(from the DB), but both are cached -- couldn't both be out-of-date?
For example:
- External API: Create instance, start it. Begin the call to "stop", which should send a request to stop down to the sled.
- Sled Agent -> Nexus: Calls this function, indicating the instance should be stopped. Let's suppose sled agent thinks gen == N, the db thinks gen == N. But maybe calling
ensure_updated_instance_network_config
yields for a really long time. - Sled Agent -> Nexus: Calls this function again, from a retry-loop, but this time it succeeds. The instance is now stopped.
- External API: Start the instance again. This re-provisions the resources used by the instance...
- ... and, while the instance is running, the request from (2) comes back. It'll see a state of the world from before (3) and (4), so it'll call the
virtual_provisioning_collection_delete_instance
function. Now the accounting is wrong! This is a running instance, but we would have deleted the resources it's currently using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is related to what I was touching on here: #4194 (comment)
If we're associating these resources with VMMs (AKA, "what is actually running") rather than Instances (the opaque thing that exists even without a propolis), should we alter the calls to virtual_provisioning_collection_{create, delete}_instance
to act on VMM UUIDs instead, rather than instance UUIDs?
That way, it wouldn't be possible to do a "double delete" -- once a VMM is gone, the UUID for it should never be re-used/re-added, so repeated/delayed requests to delete should have no further effect after the first successful call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Let me think about this one a little more--the race you've identified is definitely a problem. I'll mull it over to see if there's a good way to resolve it without dipping into transaction-land (such that we could say "only proceed with the deletion if the instance generation number is less than the one I think I'm trying to apply" or suchlike).
It's true that VMMs don't share that problem because they can't come back from a terminal state, but I want to be careful about charging counters on a per-VMM basis if those counters are going to be used for quota management purposes (which AIUI is part of the discussion in RFD 427). For example, suppose I have a quota of 32 vCPUs on my project; I start an instance with 16 vCPUs; the rack operator starts an upgrade that migrates my instance; then I try to start a second instance with 16 vCPUs; if we track per-VMM, this will fail because my quota's exhausted--but what do you mean it's exhausted? I only have the one running instance! (cc @askfongjojo to help weigh in on this part; if we're OK with charging on a per-VMM basis even though this can cause transient double-counting, then switching to VMM IDs is a much simpler option here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 16b7fe6 (or so I think...) by putting a sub-select into the instance provisioning counter deletion query that makes it apply only if the caller has furnished a sufficiently new generation number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transient double-counting is probably fine if it's indeed transient (i.e. seconds, not minutes). But if we foresee failure modes that can cause in-progress migration to double count for an extended period of time, it'll be better for the system rather than the user to "absorb" the migration-in-transit usage.
builder.append(add_virtual_resources_action()); | ||
builder.append(create_vmm_record_action()); | ||
builder.append(mark_as_starting_action()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ordering is wrong. The saga should only add virtual resources once it's successfully marked the instance as starting, i.e., it should get through the "only one start can proceed at a time" interlock before charging for anything. Will fix in the next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 16b7fe6.
Add a clause to the provisioning counter deletion query that checks (atomically via sub-selection) that the instance of interest has not advanced past a particular state generation number. This prevents a TOCTTOU bug that can cause a record to be deleted when an instance is running: - Sled agent tries to send an update stopping an instance; this gets stuck and the attempt times out - Sled agent tries again and the attempt succeeds; this deletes the provisioning counters and allows the instance to start somewhere else - The instance indeed starts somewhere else, taking new charges - The original attempt finally makes progress again and proceeds to delete the newly-allocated charges Also fix an ordering bug in the start saga: provisioning counters should only be charged after the saga has passed through the "only one start attempt at a time" interlock. While the old ordering didn't allow counters to be leaked (parallel start sagas that lose the race to set the instance's Propolis ID will unwind, which would have released the counters), this did allow multiple parallel start sagas to take multiple charges for the same instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update - plumbing a sled-provided generation number seems like a valid way to mitigate the race I identified.
I've merged the latest main and retested on a dev cluster, and things still look OK, so I'm going to go ahead and land this one. |
Only charge virtual provisioning collections for instances when those instances are running. Charges are taken in the instance start saga and dropped when a sled agent tries to transition an instance to a stopped state. Unlike sled resource charges, provisioning charges are tied to instance states, not to VMM lifetimes. This ensures that a user is not charged twice for an instance (e.g. for quota management purposes) while the instance is migrating. See RFD 427 for more discussion.
Also fix a small idempotency issue in the cleanup path for VMM resources.
Tests: updated integration tests; manually checked virtual provisioning table values in a dev cluster & checked the values on the utilization graphs.
Fixes #4257.