Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update virtual provisioning counters on instance stop/start #4277

Merged
merged 3 commits into from
Oct 20, 2023

Conversation

gjcolombo
Copy link
Contributor

Only charge virtual provisioning collections for instances when those instances are running. Charges are taken in the instance start saga and dropped when a sled agent tries to transition an instance to a stopped state. Unlike sled resource charges, provisioning charges are tied to instance states, not to VMM lifetimes. This ensures that a user is not charged twice for an instance (e.g. for quota management purposes) while the instance is migrating. See RFD 427 for more discussion.

Also fix a small idempotency issue in the cleanup path for VMM resources.

Tests: updated integration tests; manually checked virtual provisioning table values in a dev cluster & checked the values on the utilization graphs.

Fixes #4257.

…elete

Per the discussion in RFD 427, only charge virtual provisioning collections for
instances when those instances are running. Charges are taken in the instance
start saga and dropped when a sled agent tries to transition an instance to a
stopped state.

Unlike sled resource charges, provisioning charges are tied to instances, not
to VMMs. This ensures that a user is not charged twice for an instance (e.g.
for quota purposes) while the instance is migrating.

Tests: updated integration tests; manually checked virtual provisioning table
values in a dev cluster & checked the values on the utilization graphs.
@gjcolombo gjcolombo requested a review from smklein October 13, 2023 18:27
@david-crespo
Copy link
Contributor

Huge, yet small! This is awesome.

Comment on lines -965 to -976
// NOTE: I think it's arguably "more correct" to identify that the
// number of CPUs being used by guests at this point is actually "0",
// not "4", because the instance is stopped (same re: RAM usage).
//
// However, for implementation reasons, this is complicated (we have a
// tendency to update the runtime without checking the prior state, which
// makes edge-triggered behavior trickier to notice).

let virtual_provisioning_collection = datastore
.virtual_provisioning_collection_get(&opctx, project_id)
.await
.unwrap();
let expected_cpus = 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So stoked to see this get cleaned up.

&& new_runtime_state.instance_state.gen
>= db_instance.runtime().gen.0
{
self.db_datastore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is made based off of new_runtime_state (from the sled agent) as well as db_instance.runtime() (from the DB), but both are cached -- couldn't both be out-of-date?

For example:

  1. External API: Create instance, start it. Begin the call to "stop", which should send a request to stop down to the sled.
  2. Sled Agent -> Nexus: Calls this function, indicating the instance should be stopped. Let's suppose sled agent thinks gen == N, the db thinks gen == N. But maybe calling ensure_updated_instance_network_config yields for a really long time.
  3. Sled Agent -> Nexus: Calls this function again, from a retry-loop, but this time it succeeds. The instance is now stopped.
  4. External API: Start the instance again. This re-provisions the resources used by the instance...
  5. ... and, while the instance is running, the request from (2) comes back. It'll see a state of the world from before (3) and (4), so it'll call the virtual_provisioning_collection_delete_instance function. Now the accounting is wrong! This is a running instance, but we would have deleted the resources it's currently using.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to what I was touching on here: #4194 (comment)

If we're associating these resources with VMMs (AKA, "what is actually running") rather than Instances (the opaque thing that exists even without a propolis), should we alter the calls to virtual_provisioning_collection_{create, delete}_instance to act on VMM UUIDs instead, rather than instance UUIDs?

That way, it wouldn't be possible to do a "double delete" -- once a VMM is gone, the UUID for it should never be re-used/re-added, so repeated/delayed requests to delete should have no further effect after the first successful call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Let me think about this one a little more--the race you've identified is definitely a problem. I'll mull it over to see if there's a good way to resolve it without dipping into transaction-land (such that we could say "only proceed with the deletion if the instance generation number is less than the one I think I'm trying to apply" or suchlike).

It's true that VMMs don't share that problem because they can't come back from a terminal state, but I want to be careful about charging counters on a per-VMM basis if those counters are going to be used for quota management purposes (which AIUI is part of the discussion in RFD 427). For example, suppose I have a quota of 32 vCPUs on my project; I start an instance with 16 vCPUs; the rack operator starts an upgrade that migrates my instance; then I try to start a second instance with 16 vCPUs; if we track per-VMM, this will fail because my quota's exhausted--but what do you mean it's exhausted? I only have the one running instance! (cc @askfongjojo to help weigh in on this part; if we're OK with charging on a per-VMM basis even though this can cause transient double-counting, then switching to VMM IDs is a much simpler option here.)

Copy link
Contributor Author

@gjcolombo gjcolombo Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 16b7fe6 (or so I think...) by putting a sub-select into the instance provisioning counter deletion query that makes it apply only if the caller has furnished a sufficiently new generation number.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transient double-counting is probably fine if it's indeed transient (i.e. seconds, not minutes). But if we foresee failure modes that can cause in-progress migration to double count for an extended period of time, it'll be better for the system rather than the user to "absorb" the migration-in-transit usage.

Comment on lines 104 to 106
builder.append(add_virtual_resources_action());
builder.append(create_vmm_record_action());
builder.append(mark_as_starting_action());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ordering is wrong. The saga should only add virtual resources once it's successfully marked the instance as starting, i.e., it should get through the "only one start can proceed at a time" interlock before charging for anything. Will fix in the next push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 16b7fe6.

Add a clause to the provisioning counter deletion query that checks (atomically
via sub-selection) that the instance of interest has not advanced past a
particular state generation number. This prevents a TOCTTOU bug that can cause
a record to be deleted when an instance is running:

- Sled agent tries to send an update stopping an instance; this gets stuck and
  the attempt times out
- Sled agent tries again and the attempt succeeds; this deletes the provisioning
  counters and allows the instance to start somewhere else
- The instance indeed starts somewhere else, taking new charges
- The original attempt finally makes progress again and proceeds to delete the
  newly-allocated charges

Also fix an ordering bug in the start saga: provisioning counters should only
be charged after the saga has passed through the "only one start attempt at a
time" interlock. While the old ordering didn't allow counters to be leaked
(parallel start sagas that lose the race to set the instance's Propolis ID will
unwind, which would have released the counters), this did allow multiple
parallel start sagas to take multiple charges for the same instance.
@gjcolombo gjcolombo requested a review from smklein October 13, 2023 23:52
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update - plumbing a sled-provided generation number seems like a valid way to mitigate the race I identified.

@gjcolombo
Copy link
Contributor Author

I've merged the latest main and retested on a dev cluster, and things still look OK, so I'm going to go ahead and land this one.

@gjcolombo gjcolombo merged commit 9191af6 into main Oct 20, 2023
21 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/instance-provisioning branch October 20, 2023 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move virtual provisioning accounting for instances to instance start saga
4 participants