Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add external API to resize a stopped instance #6321

Closed
wants to merge 6 commits into from

Conversation

gjcolombo
Copy link
Contributor

Add an API to change the CPU/memory sizes for an existing instance (provided it's halted--no CPU hot-add or hot-remove for now, sorry).

Putting this in draft for now since I still want to manually test it (and I'd also like to refine the integration tests a bit).

Fixes #3769.

let updated = diesel::update(dsl::instance)
.filter(dsl::time_deleted.is_null())
.filter(dsl::id.eq(instance_id))
.filter(dsl::state.eq_any(ALLOWED_RESIZE_STATES))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask if we should also check for the propolis_id being null here, but I suppose we can rely on the check constraint that the instance can only be in NoVmm if it's null. Maybe worth commenting here?

external_state
)))
} else {
Err(Error::internal_error("failed to resize instance"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this case only happens if the instance's time_deleted is not null. Maybe this should also be a "not found" error, rather than 500ing? I think we generally treat deleted objects the same as nonexistent objects, right?

.unwrap()
.parsed_body::<Instance>()
.unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also check that if we get the instance, then it has the expected shape.

)
.await?;

self.db_datastore.instance_fetch_with_vmm(opctx, &authz_instance).await
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refetch is kind of sketchy (the instance's shape can change again between the resize query and the fetch). We may not actually need it: since the query is conditional on the instance being halted (i.e. being in NoVmm, or possibly in Failed once the failed-instance rework lands), there should be no VMM to influence the instance's state, which means we can construct the correct external instance state without having to consult the VMM table to get the active VMM's state. But that's itself kind of sketchy: the rule everywhere else is that you use an InstanceAndActiveVmm to get an external-API Instance to return, and I think this would be the first exception to that rule.

Another, simpler option would be to have this API return 204 No Content on success. (The caller knows what the updated instance's CPU and memory sizes are because it just set them, and nothing else about the instance is different.) OTOH several other instance routines (stop, reboot, etc.) return the posterior state of the instance, so maybe it's better to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 204 No Content approach makes less sense in view of the discussion below (where we've reached a consensus that this API should like other update APIs, which return the posterior states of the things being updated). I think it's probably reasonable to construct an InstanceAndActiveVmm in the relevant datastore routine and return it here (though this should have some explanatory comments as @hawkw mentioned above).

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to this! It'll be a nice addition to the tf provider as well :)

@@ -2378,6 +2378,62 @@
}
}
},
"/v1/instances/{instance}/resize": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason the resize action should be separated from other updateable fields (e.g. name) into it's own endpoint?

Other endpoints generally follow a pattern for "update" where all updateable fields of a resource are gathered into a single endpoint (e.g. VpcUpdate and ProjectUpdate) for consistency. This pattern makes it a bit easier to avoid endpoint explosion to keep the number of endpoints as small as possible and easier for our users to grok.

Would it be possible to change this endpoint to follow that pattern?

Copy link
Contributor

@david-crespo david-crespo Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s precedent here if you think of resize as an instance action like start or stop:

instance_start POST /v1/instances/{instance}/start
instance_stop POST /v1/instances/{instance}/stop

I think it is not necessarily surprising. We don’t have an instance update endpoint, so nobody can be confused about what goes in there vs. these dedicated ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could have a single instance-update endpoint, but we'd have to decide what to do about the distinction between fields that you can theoretically update on a running instance (name, description) vs. the ones that require the instance to be halted (CPUs, memory, hostname, public keys, user data, etc.). My initial inclination would be to say that all updates (even the "safe" ones) require the instance to be halted, and if it turns out that that's a huge headache for users, we can pivot to allowing updates on running instances if they only touch safe-when-running fields. (This is more complicated to do on the backend, but I don't think it's intractable.) WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps those two endpoints should be rolled up into a single update endpoint like the others? 🤔

It may be confusing for users to have some resources with endpoints that take an "update object" and others that have several endpoints to update different parts of a resource. It'd be nice to have some sort of consistency when possible. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @gjcolombo! Looks like we were typing at the same time and my answer above looks a bit weird now 😄

I think we could have a single instance-update endpoint, but we'd have to decide what to do about the distinction between fields that you can theoretically update on a running instance (name, description) vs. the ones that require the instance to be halted (CPUs, memory, hostname, public keys, user data, etc.). My initial inclination would be to say that all updates (even the "safe" ones) require the instance to be halted, and if it turns out that that's a huge headache for users, we can pivot to allowing updates on running instances if they only touch safe-when-running fields. (This is more complicated to do on the backend, but I don't think it's intractable.) WDYT?

Yeah! I think this is a good compromise. I don't see users changing the name or description of an instance often anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see that working for the fields Greg lists, but not for start and stop and reboot. The problem with the latter is that they take time to complete asynchronously, with intermediate states (starting, stopping) before the final state, so the PUT can’t return you the resource in a state matching what you sent it (running, for example), which is how PUTs are supposed to work.

Regarding requiring the instance to be stopped for all updates, whether the field really needs it or not, I agree that’s fine and easier to understand as a user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see that working for the fields Greg lists, but not for start and stop and reboot. The problem with the latter is that they take time to complete asynchronously, with intermediate states (starting, stopping) before the final state, so the PUT can’t return you the resource in a state matching what you sent it (running, for example), which is how PUTs are supposed to work.

That makes sense, I'm OK with keeping start/stop separate then

iximeow added a commit that referenced this pull request Oct 1, 2024
)

the Omicron side of adding explicit boot order selection to instances
(counterpart to
[propolis#756](oxidecomputer/propolis#756)).

first, this extends `params::InstanceCreate` to take a new `boot_disk:
Option<params::InstanceDiskAttachment>`.

additionally, this adds a `PUT /v1/instances/{instance}` to update
instances. the only property that can be updated at the moment is
`boot_disk`, pick a new boot disk or unset it entirely. this also
partially subsumes #6321.

finally, this updates Omicron to reference a recent enough Propolis that
#756 is included.

a surprising discovery along the way: you can specify a disk to be
attached multiple times in `disks` today, when creating an instance, and
we're fine with it! this carries through with the new `boot_disk` field:
if you specify the disk as `boot_disk` and in `disks`, it is technically
listing the disk for attachment twice but this will succeed.
@iximeow
Copy link
Member

iximeow commented Oct 4, 2024

gonna close this in favor of the definitely-a-derived-work #6774. a lot of the conversation here helped influence how the instance update endpoint behaves, but otherwise i think i've carried over or resolved open items here in the process.

@iximeow iximeow closed this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Want ability to resize the vcpu and memory of stopped instances
5 participants