Update virtual provisioning counters on instance stop/start #4277

gjcolombo · 2023-10-13T18:27:42Z

Only charge virtual provisioning collections for instances when those instances are running. Charges are taken in the instance start saga and dropped when a sled agent tries to transition an instance to a stopped state. Unlike sled resource charges, provisioning charges are tied to instance states, not to VMM lifetimes. This ensures that a user is not charged twice for an instance (e.g. for quota management purposes) while the instance is migrating. See RFD 427 for more discussion.

Also fix a small idempotency issue in the cleanup path for VMM resources.

Tests: updated integration tests; manually checked virtual provisioning table values in a dev cluster & checked the values on the utilization graphs.

Fixes #4257.

…elete Per the discussion in RFD 427, only charge virtual provisioning collections for instances when those instances are running. Charges are taken in the instance start saga and dropped when a sled agent tries to transition an instance to a stopped state. Unlike sled resource charges, provisioning charges are tied to instances, not to VMMs. This ensures that a user is not charged twice for an instance (e.g. for quota purposes) while the instance is migrating. Tests: updated integration tests; manually checked virtual provisioning table values in a dev cluster & checked the values on the utilization graphs.

david-crespo · 2023-10-13T18:50:56Z

Huge, yet small! This is awesome.

smklein · 2023-10-13T18:55:59Z

nexus/tests/integration_tests/instances.rs

-    // NOTE: I think it's arguably "more correct" to identify that the
-    // number of CPUs being used by guests at this point is actually "0",
-    // not "4", because the instance is stopped (same re: RAM usage).
-    //
-    // However, for implementation reasons, this is complicated (we have a
-    // tendency to update the runtime without checking the prior state, which
-    // makes edge-triggered behavior trickier to notice).
+
    let virtual_provisioning_collection = datastore
        .virtual_provisioning_collection_get(&opctx, project_id)
        .await
        .unwrap();
-    let expected_cpus = 4;


So stoked to see this get cleaned up.

smklein · 2023-10-13T19:12:32Z

nexus/src/app/instance.rs

+            && new_runtime_state.instance_state.gen
+                >= db_instance.runtime().gen.0
+        {
+            self.db_datastore


This check is made based off of new_runtime_state (from the sled agent) as well as db_instance.runtime() (from the DB), but both are cached -- couldn't both be out-of-date?

For example:

External API: Create instance, start it. Begin the call to "stop", which should send a request to stop down to the sled.

Sled Agent -> Nexus: Calls this function, indicating the instance should be stopped. Let's suppose sled agent thinks gen == N, the db thinks gen == N. But maybe calling ensure_updated_instance_network_config yields for a really long time.

Sled Agent -> Nexus: Calls this function again, from a retry-loop, but this time it succeeds. The instance is now stopped.

External API: Start the instance again. This re-provisions the resources used by the instance...

... and, while the instance is running, the request from (2) comes back. It'll see a state of the world from before (3) and (4), so it'll call the virtual_provisioning_collection_delete_instance function. Now the accounting is wrong! This is a running instance, but we would have deleted the resources it's currently using.

I think this is related to what I was touching on here: #4194 (comment)

If we're associating these resources with VMMs (AKA, "what is actually running") rather than Instances (the opaque thing that exists even without a propolis), should we alter the calls to virtual_provisioning_collection_{create, delete}_instance to act on VMM UUIDs instead, rather than instance UUIDs?

That way, it wouldn't be possible to do a "double delete" -- once a VMM is gone, the UUID for it should never be re-used/re-added, so repeated/delayed requests to delete should have no further effect after the first successful call.

Hmm. Let me think about this one a little more--the race you've identified is definitely a problem. I'll mull it over to see if there's a good way to resolve it without dipping into transaction-land (such that we could say "only proceed with the deletion if the instance generation number is less than the one I think I'm trying to apply" or suchlike).

It's true that VMMs don't share that problem because they can't come back from a terminal state, but I want to be careful about charging counters on a per-VMM basis if those counters are going to be used for quota management purposes (which AIUI is part of the discussion in RFD 427). For example, suppose I have a quota of 32 vCPUs on my project; I start an instance with 16 vCPUs; the rack operator starts an upgrade that migrates my instance; then I try to start a second instance with 16 vCPUs; if we track per-VMM, this will fail because my quota's exhausted--but what do you mean it's exhausted? I only have the one running instance! (cc @askfongjojo to help weigh in on this part; if we're OK with charging on a per-VMM basis even though this can cause transient double-counting, then switching to VMM IDs is a much simpler option here.)

Fixed in 16b7fe6 (or so I think...) by putting a sub-select into the instance provisioning counter deletion query that makes it apply only if the caller has furnished a sufficiently new generation number.

Transient double-counting is probably fine if it's indeed transient (i.e. seconds, not minutes). But if we foresee failure modes that can cause in-progress migration to double count for an extended period of time, it'll be better for the system rather than the user to "absorb" the migration-in-transit usage.

gjcolombo · 2023-10-13T20:29:43Z

nexus/src/app/sagas/instance_start.rs

+        builder.append(add_virtual_resources_action());
        builder.append(create_vmm_record_action());
        builder.append(mark_as_starting_action());


This ordering is wrong. The saga should only add virtual resources once it's successfully marked the instance as starting, i.e., it should get through the "only one start can proceed at a time" interlock before charging for anything. Will fix in the next push.

Fixed in 16b7fe6.

Add a clause to the provisioning counter deletion query that checks (atomically via sub-selection) that the instance of interest has not advanced past a particular state generation number. This prevents a TOCTTOU bug that can cause a record to be deleted when an instance is running: - Sled agent tries to send an update stopping an instance; this gets stuck and the attempt times out - Sled agent tries again and the attempt succeeds; this deletes the provisioning counters and allows the instance to start somewhere else - The instance indeed starts somewhere else, taking new charges - The original attempt finally makes progress again and proceeds to delete the newly-allocated charges Also fix an ordering bug in the start saga: provisioning counters should only be charged after the saga has passed through the "only one start attempt at a time" interlock. While the old ordering didn't allow counters to be leaked (parallel start sagas that lose the race to set the instance's Propolis ID will unwind, which would have released the counters), this did allow multiple parallel start sagas to take multiple charges for the same instance.

smklein

Thanks for the update - plumbing a sled-provided generation number seems like a valid way to mitigate the race I identified.

gjcolombo · 2023-10-20T16:57:24Z

I've merged the latest main and retested on a dev cluster, and things still look OK, so I'm going to go ahead and land this one.

gjcolombo requested a review from smklein October 13, 2023 18:27

smklein reviewed Oct 13, 2023

View reviewed changes

gjcolombo commented Oct 13, 2023

View reviewed changes

gjcolombo requested a review from smklein October 13, 2023 23:52

smklein approved these changes Oct 18, 2023

View reviewed changes

Merge branch 'main' into gjcolombo/instance-provisioning

1d2b479

gjcolombo merged commit 9191af6 into main Oct 20, 2023
21 checks passed

gjcolombo deleted the gjcolombo/instance-provisioning branch October 20, 2023 16:57

gjcolombo mentioned this pull request Dec 11, 2023

Resource accounting error after failed instance-start #4662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update virtual provisioning counters on instance stop/start #4277

Update virtual provisioning counters on instance stop/start #4277

gjcolombo commented Oct 13, 2023

david-crespo commented Oct 13, 2023

smklein Oct 13, 2023

smklein Oct 13, 2023

smklein Oct 13, 2023

gjcolombo Oct 13, 2023

gjcolombo Oct 13, 2023 •

edited

Loading

askfongjojo Oct 14, 2023

gjcolombo Oct 13, 2023

gjcolombo Oct 13, 2023

smklein left a comment

gjcolombo commented Oct 20, 2023

Update virtual provisioning counters on instance stop/start #4277

Update virtual provisioning counters on instance stop/start #4277

Conversation

gjcolombo commented Oct 13, 2023

david-crespo commented Oct 13, 2023

smklein Oct 13, 2023

Choose a reason for hiding this comment

smklein Oct 13, 2023

Choose a reason for hiding this comment

smklein Oct 13, 2023

Choose a reason for hiding this comment

gjcolombo Oct 13, 2023

Choose a reason for hiding this comment

gjcolombo Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

askfongjojo Oct 14, 2023

Choose a reason for hiding this comment

gjcolombo Oct 13, 2023

Choose a reason for hiding this comment

gjcolombo Oct 13, 2023

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

gjcolombo commented Oct 20, 2023

gjcolombo Oct 13, 2023 •

edited

Loading