Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split instance state into Instance and VMM tables #4194

Merged
merged 59 commits into from
Oct 12, 2023

Conversation

gjcolombo
Copy link
Contributor

Reviewer note: Regrettably but unavoidably, this PR is huge because so many things depend on the definition of an Instance in CRDB. I recommend reading individual commits and their descriptions to make things more manageable. The commits are arranged in more-or-less narrative order: database schema first, then supporting types, then sled agent, then Nexus DB queries, then Nexus instance functions, then sagas, then assorted bug fixes and bolt-tightening.


Refactor the definition of an Instance throughout the control plane so that an Instance is separate from the Vmms that incarnate it. This confers several advantages:

  • VMMs have their own state that sled agent can update without necessarily changing their instance's state. It's also possible to change an instance's active Propolis ID without having to know or update an instance's Propolis IP or current sled ID, since these change when an instance's active Propolis ID changes. This removes a great deal of complexity in sled agent, especially when live migrating an instance, and also simplifies the live migration saga considerably.
  • Resource reservations for instances have much clearer lifetimes: a reservation can be released when its VMM has moved to a terminal state. Nexus no longer has to reason about VMM lifetimes from changes to an instance's Propolis ID columns.
  • It's now possible for an Instance not to have an active Propolis at all! This allows an instance not to reserve sled resources when it's not running. It also allows an instance to stop and restart on a different sled.
  • It's also possible to get a history of an instance's VMMs for, e.g., zone bundle examination purposes ("my VMM had a problem two days ago but it went away when I stopped and restarted it; can you investigate?").

Rework callers throughout Nexus who depend on knowing an instance's current state and/or its current sled ID. In many cases (e.g. disk and NIC attach and detach), the relevant detail is whether the instance has an active Propolis; for simplicity, augment these checks with "has an active Propolis ID" instead of trying to grab both instance and VMM states.

Known issues/remaining work

  • The virtual provisioning table is still updated only at instance creation/deletion time. Usage metrics that depend on this table might report strange and wonderful values if a user creates many more instances than can be started at one time.
  • Instances still use the generic "resource attachment" CTE to manage attaching and detaching disks. Previously these queries looked at instance states; now they look at an instance's state and whether it has an active Propolis, but not at the active Propolis's state. This will need to be revisited in the future to support disk hotplug.
  • handle_instance_put_result is still very aggressive about setting instances to the Failed state if sled agent returns errors other than invalid-request-flavored errors. I think we should reconsider this behavior, but this change is big enough as it is. I will file a TODO for this and update the new comments accordingly before this merges.
  • The new live migration logic is not tested yet and differs from the "two-table" TLA+ model in RFD 361. More work will be needed here before we can declare live migration fully ready for selfhosting.
  • It would be nice to have an omdb vmm command; for now I've just updated existing omdb commands to deal with the optionality of Propolises and sleds.

Tests

  • Unit/integration tests
  • On a single-machine dev cluster, created two instances and verified that:
    • The instances only have resource reservations while they're running (and they reserve reservoir space now)
    • The instances can reach each other over their internal and external IPs when they're both running (and can still reach each other even if you try to delete one while it's active)
    • scadm shows the appropriate IP mappings being added/deleted as the instances start/stop
    • The instances' serial consoles work as expected
    • Attaching a new disk to an instance is only possible if the instance is stopped
    • Disk snapshot succeeds when invoked on a running instance's attached disk
    • Deleting an instance detaches its disks
  • omicron-stress on a single-machine dev cluster ran for about an hour and created ~800 instances without any instances going to the Failed state (previously this would happen in the first 5-10 minutes)

Related issues & PRs

Fixes #2315. Fixes #3139. Fixes #3883. Fixes #3228. Fixes #3877 (all the logic of interest from that issue is in the instance start saga now, and that saga has more verbose logging statements). Fixes #2842 (networking state is now torn down on instance stop). Mitigates #3325 (by making sled agent more careful to remove instances from its instance manager before transitioning them to stopped states). Supersedes PR #3141 (though see caveat above about how well live migration has--or hasn't--been tested here).

See also #3742.

Create schema version 6.0.0, which defines the vmm table, updates the instance
table to refer to it, and updates the sled_instance view accordingly, and update
dbinit.sql accordingly.

Slightly reorder the tests in the schema upgrade tests so that table constraints
are checked before CHECK constraints. I found the table constraints to be a
little more actionable than the CHECK constraints (they do a better job of
describing exactly which tables have moved around in the case where tests are
failing because the order of table definitions changed).

Add tips for adding tables, modifying views, and renaming columns to the schema
README.

Tested by running the schema integration tests. Basically nothing else will work
at this point because the rest of Nexus hasn't yet learned about the new schema.
Update the DB model crate to reflect changes in the previous commit:

- Rearrange the Instance type:
  - Move more-or-less static properties of an instance (shape, hostname, boot-
    on-fault discipline) into `struct Instance`.
  - Update `struct InstanceRuntimeState`: make Propolis IDs optional; remove
    sled IDs, Propolis IPs, the Propolis generation number; remove fields moved
    into main `Instance` body.
- Define the `Vmm` and `VmmRuntimeState` types.
- Remove all the `From` impls to convert instance DB types into various
  Progenitorized types, external API types, and the like. These will be
  reintroduced in future commits as the new Nexus internal API types and sled
  agent API types come into being.
Take `api::internal::nexus::InstanceRuntimeState` down to the studs and rebuild
it with only those runtime properties that are in the Instance table. Add an
accompanying `VmmRuntimeState` type and a `SledInstanceState` type that sled
agent will use to send runtime state updates back up to Nexus. Continue
temporarily removing From impls that are obsoleted by these changes.
Tweak `InstanceHardware` and `InstanceEnsureBody` to allow Nexus to supply
information that was moved out of `InstanceRuntimeState`. Return a
`SledInstanceState` instead of an `InstanceRuntimeState` from APIs that can
update an instance's state inline.
N.B. This is one of the larger functional changes on the sled agent side of this
change sequence and could use some extra attention.

Update the `InstanceStates` structure in `common/instance.rs` to track instance
and VMM runtime states separately, since these are now independent data types.
This structure also needs to track the ID of the VMM it's tracking so that it
can produce a `SledInstanceStates` on demand to send back to Nexus. (Note that
this ID can no longer be inferred from the instance runtime state--see below.)

Significantly rework the algorithm for completing live migration. Previously,
migration sources couldn't transfer control to migration targets on observing
a successful migration, because instance records only contained the target
Propolis's ID and not its sled ID or server IP. Now, having the new Propolis ID
is enough to change the instance's "VMM pointer," which is enough to change the
sled ID and Propolis IP implicitly.

Add logic to "retire" an active Propolis if sled agent observes a "destroyed" or
"failed" Propolis state and the Propolis in question is still the instance's
active VMM.

Update the unit tests to check instance and VMM states separately. Add checks
that verify that an instance's generation number changes if and only if one of
the fields protected by that generation number has changed.

This commit does not update the many callers of `InstanceStates` who now need to
adapt to its new API surface.
This fixes most of the build issues in sled agent's Instance/InstanceInner
structs that were introduced by the previous commit. Like its predecessor it
contains some load-bearing logical changes:

- When a Propolis exits and sled agent decides to terminate its zone, the agent
  tears down the zone and removes the instance ticket *before* publishing any
  observations to Nexus, not after. This helps to ensure that Nexus won't be
  told that a VMM has vacated a sled until its Propolis zone is actually gone.
  This is needed to prevent issues where a sled stops an instance, tells Nexus
  that it's gone (when it isn't yet), and then is asked to restart the instance
  (by a new start request that landed on the same sled) when there's still a
  registered VMM.
- InstanceStates now provides a `terminate_rudely` operation that callers can
  invoke after rudely terminating a Propolis zone. This creates a fake Propolis
  observation that reflects the state that the VMM "would have had" if it had
  stopped cleanly.

  This operation is tricky when used on a migrating instance. It is possible for
  a rude termination to arrive when an instance has migrated successfully but
  neither sled agent has found out about the successful migration yet. Rather
  than try to prevent this, this change puts the onus on sled agent's callers to
  ensure that they terminate VMMs in "reverse order"--i.e., targets before
  sources--or to ensure that they only rudely terminate VMMs that are not part
  of an active live migration. Currently, rude termination is only used in
  migration and start sagas that follow this rule.
- Starting a new Propolis zone no longer publishes the "Starting" state to Nexus
  before beginning to set up the zone. This will no longer be needed in the
  Brave New World, because when starting a new VMM, Nexus will create that VMM
  record in the Starting state and point the instance to it--there is no need to
  transition from Creating to Starting.

`sled-agent/src/instance.rs` still needs more work to compile, but this is
mostly just reorganizing parameters to account for the fact that
`instance_ensure`'s arguments have changed.

The simulated instance module is totally on the floor at this point and will be
reworked in the next commit.
Fix up instance simulation to match the new behavior of `Instance` and
`InstanceInner` described in the previous commit. This takes some advantage of
sled agent's newfound ability to assume that Nexus will create VMMs in the
correct initial states when registering them with a sled.
These need some minor restructuring to handle the new shape of the
InstanceStates structure.

Do away with the `is_stopped` function for instance states; this isn't used
anywhere outside of tests, and the tests can just set their own expectations for
whether a VMM should pass through the Stopping state or not.
Now that all the actual sled agent internals are updated, fix the parameter
plumbing from the entry points to those internals.
Fix a couple of small test bugs. With this sled agent tests pass except for the
OpenAPI spec test.
An instance's logical state is now a function of the Instance record and its
associated active Vmm record, if one exists. Start adapting nexus-db-queries
code to this model. Specifically:

- Define an `InstanceAndActiveVmm` type and a conversion to an external instance
  description. Add and update instance fetch and list routines to yield this
  type.
- Update disk and network attachment queries that checked for stopped instances
  to also check that an instance has no active VMM.
- Update `vpc_resolve_to_sleds` to pick up sled IDs from the Vmm table instead
  of the Instance table.

Also tweak a little bit of Nexus instance management logic to use
`InstanceAndActiveVmm` as an intermediate type.
Sled agents now (or will soon) send Nexus updates in the form of a
`SledInstanceStates` struct that contains updated instance and VMM state in a
single package. Write a CTE that updates an instance and VMM in a single
atomic statement to allow sled agent to assume that instance and VMM updates
will be applied atomically. See the comments in the new query module for more
details.
The network interface query module has a handcrafted subquery that tries to
reason about an instance's state using only the instance record. Fix this up to
also account for whether the instance has an active VMM. Also update the tests
in this module to paste a fake VMM ID into an instance in lieu of setting its
state to Running.

Fix other portions of the network interface test build that were broken by
changes to the instance record's format.

Add a `From` impl to convert api::internal::nexus::InstanceRuntimeState to
nexus_db_model::InstanceRuntimeState. This isn't technically needed by this
change (I misread a build error) but is going to be needed soon enough.
Instances don't always have sleds now. Adjust the signatures of test helpers
that return an instance's sled ID/sled agent client to return Option. Adjust
their callers to deal with it; for the most part, integration tests only care
about an instance's location when it's already started, so this mostly amounts
to adding an additional layer of expect() calls where warranted.
Apply some straightforward fixes to functions in nexus/src/app/instance.rs:

- In functions that need to return an instance's new state, return an
  `InstanceAndActiveVmm` instead of a `nexus_db_model::Instance`.
- When starting or migrating an instance, reason about the instance's prior
  state using its current instance and VMM states. (These two operations are
  special because they kick off sagas; this commit doesn't adjust operations
  that try to interact directly with a running instance on its current sled.)
- Change serial console functions to get an instance's Propolis IP from its
  active VMM record.

Also add another missing DB model type conversion impl.

The next few commits will deal with more complicated operations in this module.
Make `cpapi_instances_put` take a `SledInstanceState` and rework its callees
accordingly:

- Ensure that an instance's networking state is torn down when the instance
  stops (in addition to ensuring it is correct after migrating).
- Release VMMs' resource reservations when they are destroyed.
- Add some datastore primitives for working with VMMs. Not all of these are
  used just yet, but it was easiest to add them all in one shot.

Also fix a couple of serial console API build errors missed in the last commit.
- Sled agent now returns instance/VMM runtime state pairs from its PUT calls,
  so allow both of these to be updated when handling a PUT result.
- Add dire warnings about the perils of marking an instance as Failed if sled
  agent returns an error from one of these calls. (This needs to have an issue
  filed.)
- Improve tracing in this path to include instance IDs.
The outcome of requesting a state change now depends in part on whether an
instance has an active VMM and, if so, what that VMM's state is.

Requesting state changes also requires an instance's current VMM to determine
what sled to send state change requests to.

This commit fixes up callers of instance_request_state in instance.rs itself
but not in the sagas that call it (sagas will be updated later in their own
individual commits).
It's no longer possible to go from an instance record to a sled ID. Remove the
function that did this and make its callers take sled IDs explicitly.

Also fix a tiny build error in cidata.rs.
Merely creating an instance no longer assigns a sled ID or a Propolis ID/IP,
so remove all those nodes from the create saga.

To handle create-and-start requests, make the create saga move newly-created
instances to Stopped, then have the create saga's invoker run a start saga once
the request to create the instance finishes. This means the create saga no
longer has to know how to set up instance networking state, so remove all those
nodes too. (Note that this is a minor change in behavior: previously, if a
client successfully generated a create saga and then its Nexus server crashed,
the client call would fail but the instance would still start, because that
was part of the saga; now, the instance will be created but not started. If
desired this can be addressed in the future by making the start saga a subsaga
of the create saga, but that work is not relevant to the instance/VMM split and
so is left for another day.)
This saga now has to allocate a sled for the instance to run on and give its VMM
an IP on that sled, so start doing that.

Create a new module with some common instance operations that this saga shares
with the migration saga (which will be cleaned up next).
Change the migration saga to create VMM records and to use common code to
reserve space on the destination sled.

Splitting VMM information off from the instance table allows the source and
target sled agents to use the same "pre-migration" instance record--it's no
longer necessary to synthesize an instance record for the target with an
as-yet-unused Propolis ID, and accordingly it's no longer necessary to do extra
work to avoid writing that record back to CRDB. Moreover, sled agent now
understands whether a VMM is the "active" VMM for an instance, so it can
guarantee that unregistering a prospective migration target (as in saga unwind)
won't produce a new instance record. This considerably simplifies saga unwind.
Rewrite the saga comments accordingly.

Also take this opportunity to drop the `WriteBack` discipline enum from instance
unregistration, since it's no longer needed.

Add a sled agent unit test that explicitly verifies that rudely terminating a
migration target VMM leaves its instance record unchanged.
This is mostly just code removal, because VM networking state and sled
resource reservations now get cleaned up when an instance stops. (The saga
implicitly checks that an instance has no active VMM before allowing that
instance to be deleted; this was handled a few commits ago in changes to
`project_delete_instance`).

Because this saga no longer needs an instance sled ID, it doesn't need to fetch
soft-deleted instances from CRDB anymore. Remove the associated routine from
the datastore code.
When creating a snapshot saga for an attached disk, get its instance's current
state and sled ID (if it has one) and use these to construct an optional
instance-and-sled tuple to pass to the saga instead of setting a "use the
pantry" flag to false and then looking this information up again mid-saga.

The old and new code are morally equivalent from a synchronization standpoint:
it was and still is possible for an instance to stop or move between the time
the saga decides what sled agent to talk to and the time the snapshot request
is issued. (The window is a little wider with this change in place, though.)
Also add a helper function to get the sled ID from an InstanceAndActiveVmm,
since the `.vmm().as_ref().expect().sled_id` pattern cropped up in a few places.
Also fix some dorky indentation that seeped into the start saga definition.
The relevant caller (`reserve_vmm_resources`) already does the right thing; I
just forgot to apply this part of the patch. Oops.
Diesel's struct derives assume that the order of fields in a struct matches
the order of columns in the corresponding `table!` directive. Make sure this
is the case for the instance table (the static machine properties moved when
they were promoted from the runtime state to the regular instance record).
Use 'propolis_id' to refer to the ID of a VMM object, except in parts of the
datastore where it is very clear that what's being passed is the primary key
into the Vmm table.
Copy link
Contributor Author

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed a bunch of the easier stuff but haven't gotten around to the fallback_state thing yet.

nexus/db-model/src/vmm.rs Show resolved Hide resolved
nexus/db-queries/src/db/queries/instance.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/queries/instance.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/queries/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/common/instance.rs Outdated Show resolved Hide resolved
schema/crdb/6.0.0/up.sql Outdated Show resolved Hide resolved
dev-tools/omdb/src/bin/omdb/db.rs Show resolved Hide resolved
Sled agent has one path that updates the `state` column in an instance record
(the `fallback_state` field in an `InstanceRuntimeState`): if an active Propolis
is retired (i.e. the instance's Propolis ID goes from `Some` to `None`), the
instance goes to the Stopped state.

This rule is so simple that it can be implemented entirely in Nexus, giving
Nexus sole ownership of this table column:

- Remove the `fallback_state` field from `api::internal::InstanceRuntimeState`.
  This completely hides the field from the sled agent.
- When Nexus processes a new `SledInstanceState`, check the
  `active_propolis_id`: if it's `Some`, the state is Running; if it's `None`,
  the state is Stopped.

The only other states that an instance can now have are Creating and Destroyed,
both of which are set only by Nexus and only when it is known that no sled agent
refers to the instance (either by definition when an instance is being created,
or because the instance has no active VMM when it's being destroyed).

Rename the `fallback_state` field to `nexus_state` to reflect its new ownership.
@gjcolombo
Copy link
Contributor Author

@smklein This is ready for another lap at your convenience. I've resolved all the comment threads that needed no action or where I made a code change; there are a few things where there's still some discussion to be had and I've not made any changes yet, and I've left those open.

Thank you again for wrestling with this giant change!

Copy link
Contributor

@jordanhendricks jordanhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a look through, mostly for my own understanding; I don't have a lot of useful feedback. Thanks for taking this on -- this is a huge change and I know it was a ton of work! Glad to see it land.

nexus/src/app/instance.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up on the fallback_state and SQL schema change stuff.

Let's get this in!

@gjcolombo gjcolombo enabled auto-merge (squash) October 12, 2023 04:47
@gjcolombo gjcolombo merged commit 876e8ca into main Oct 12, 2023
21 of 22 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/your-vmm-table-is-ready branch October 12, 2023 05:25
gjcolombo added a commit that referenced this pull request Oct 13, 2023
Finding boundary switches in the instance start saga requires fleet
query access. Use the Nexus instance allocation context to get it
instead of the saga initiator's operation context. (This was introduced
in #3873; it wasn't previously a problem because the instance create
saga set up instance NAT state, and that saga received a boundary switch
list as a parameter, which parameter was generated by using the instance
allocation context. #4194 made this worse by making instance create use
the start saga to start instances instead of using its own saga nodes.)

Update the instance-in-silo integration test to make sure that instances
created by a silo collaborator actually start. This is unfortunately not
very elegant. The `instance_simulate` function family normally uses
`OpContext::for_tests` to get an operation context, but that context is
associated with a user that isn't in the test silo. To get around this,
add some simulation interfaces that take an explicit `OpContext` and
then generate one corresponding to the silo user for the test case in
question. It seems like it'd be nicer to give helper routines like
`instance_simulate` access to a context that is omnipotent across all
silos, but I wasn't sure how best to do this. I'm definitely open to
suggestions here.

Tested via cargo tests.

Fixes #4272.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants