Skip to content

Commit

Permalink
Perform instance state transitions in instance-update saga (#5749)
Browse files Browse the repository at this point in the history
A number of bugs relating to guest instance lifecycle management have
been observed. These include:

- Instances getting "stuck" in a transient state, such as `Starting` or
`Stopping`, with no way to forcibly terminate them (#4004)
- Race conditions between instances starting and receiving state
updates, which cause provisioning counters to underflow (#5042)
- Instances entering and exiting the `Failed` state when nothing is
actually wrong with them, potentially leaking virtual resources (#4226)

These typically require support intervention to resolve.

Broadly , these issues exist because the control plane's current
mechanisms for understanding and managing an instance's lifecycle state
machine are "kind of a mess". In particular:

- **(Conceptual) ownership of the CRDB `instance` record is currently
split between Nexus and sled-agent(s).** Although Nexus is the only
entity that actually reads or writes to the database, the instance's
runtime state is also modified by the sled-agents that manage its active
Propolis (and, if it's migrating, it's target Propolis), and written to
CRDB on their behalf by Nexus. This means that there are multiple copies
of the instance's state in different places at the same time, which can
potentially get out of sync. When an instance is migrating, its state is
updated by two different sled-agents, and they may potentially generate
state updates that conflict with each other. And, splitting the
responsibility between Nexus and sled-agent makes the code more complex
and harder to understand: there is no one place where all instance state
machine transitions are performed.
- **Nexus doesn't ensure that instance state updates are processed
reliably.** Instance state transitions triggered by user actions, such
as `instance-start` and `instance-delete`, are performed by distributed
sagas, ensuring that they run to completion even if the Nexus instance
executing them comes to an untimely end. This is *not* the case for
operations that result from instance state transitions reported by
sled-agents, which just happen in the HTTP APIs for reporting instance
states. If the Nexus processing such a transition crashes, gets network
partition'd, or encountering a transient error, the instance is left in
an incomplete state and the remainder of the operation will not be
performed.

This branch rewrites much of the control plane's instance state
management subsystem to resolve these issues. At a high level, it makes
the following high-level changes:

- **Nexus is now the sole owner of the `instance` record.** Sled-agents
no longer have their own copies of an instance's `InstanceRuntimeState`,
and do not generate changes to that state when reporting instance
observations to Nexus. Instead, the sled-agent only publishes updates to
the `vmm` and `migration` records (which are never modified by Nexus
directly) and Nexus is the only entity responsible for determining how
an instance's state should change in response to a VMM or migration
state update.
- **When an instance has an active VMM, its effective external state is
determined primarily by the active `vmm` record**, so that fewer state
transitions *require* changes to the `instance` record. PR #5854 laid
the ground work for this change, but it's relevant here as well.
- **All updates to an `instance` record (and resources conceptually
owned by that instance) are performed by a distributed saga.** I've
introduced a new `instance-update` saga, which is responsible for
performing all changes to the `instance` record, virtual provisioning
resources, and instance network config that are performed as part of a
state transition. Moving this to a saga helps us to ensure that these
operations are always run to completion, even in the event of a sudden
Nexus death.
- **Consistency of instance state changes is ensured by distributed
locking.** State changes may be published by multiple sled-agents to
different Nexus replicas. If one Nexus replica is processing a state
change received from a sled-agent, and then the instance's state changes
again, and the sled-agent publishes that state change to a *different*
Nexus...lots of bad things can happen, since the second state change may
be performed from the previous initial state, when it *should* have a
"happens-after" relationship with the other state transition. And, some
operations may contradict each other when performed concurrently.

To prevent these race conditions, this PR has the dubious honor of using
the first _distributed lock_ in the Oxide control plane, the "instance
updater lock". I introduced the locking primitives in PR #5831 --- see
that branch for more discussion of locking.
- **Background tasks are added to prevent missed updates**. To ensure we
cannot accidentally miss an instance update even if a Nexus dies, hits a
network partition, or just chooses to eat the state update accidentally,
we add a new `instance-updater` background task, which queries the
database for instances that are in states that require an update saga
without such a saga running, and starts the requisite sagas.

Currently, the instance update saga runs in the following cases:

- An instance's active VMM transitions to `Destroyed`, in which case the
instance's virtual resources are cleaned up and the active VMM is
unlinked.
- Either side of an instance's live migration reports that the migration
has completed successfully.
- Either side of an instance's live migration reports that the migration
has failed.

The inner workings of the instance-update saga itself is fairly complex,
and has some kind of interesting idiosyncrasies relative to the existing
sagas. I've written up a [lengthy comment] that provides an overview of
the theory behind the design of the saga and its principles of
operation, so I won't reproduce that in this commit message.

[lengthy comment]:
https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
  • Loading branch information
hawkw authored Aug 9, 2024
1 parent d391e5c commit 9b595e9
Show file tree
Hide file tree
Showing 65 changed files with 7,305 additions and 3,018 deletions.
33 changes: 2 additions & 31 deletions clients/nexus-client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -122,22 +122,6 @@ impl From<types::VmmState> for omicron_common::api::internal::nexus::VmmState {
}
}

impl From<omicron_common::api::internal::nexus::InstanceRuntimeState>
for types::InstanceRuntimeState
{
fn from(
s: omicron_common::api::internal::nexus::InstanceRuntimeState,
) -> Self {
Self {
dst_propolis_id: s.dst_propolis_id,
gen: s.gen,
migration_id: s.migration_id,
propolis_id: s.propolis_id,
time_updated: s.time_updated,
}
}
}

impl From<omicron_common::api::internal::nexus::VmmRuntimeState>
for types::VmmRuntimeState
{
Expand All @@ -153,10 +137,10 @@ impl From<omicron_common::api::internal::nexus::SledInstanceState>
s: omicron_common::api::internal::nexus::SledInstanceState,
) -> Self {
Self {
instance_state: s.instance_state.into(),
propolis_id: s.propolis_id,
vmm_state: s.vmm_state.into(),
migration_state: s.migration_state.map(Into::into),
migration_in: s.migration_in.map(Into::into),
migration_out: s.migration_out.map(Into::into),
}
}
}
Expand All @@ -169,26 +153,13 @@ impl From<omicron_common::api::internal::nexus::MigrationRuntimeState>
) -> Self {
Self {
migration_id: s.migration_id,
role: s.role.into(),
state: s.state.into(),
gen: s.gen,
time_updated: s.time_updated,
}
}
}

impl From<omicron_common::api::internal::nexus::MigrationRole>
for types::MigrationRole
{
fn from(s: omicron_common::api::internal::nexus::MigrationRole) -> Self {
use omicron_common::api::internal::nexus::MigrationRole as Input;
match s {
Input::Source => Self::Source,
Input::Target => Self::Target,
}
}
}

impl From<omicron_common::api::internal::nexus::MigrationState>
for types::MigrationState
{
Expand Down
79 changes: 64 additions & 15 deletions clients/sled-agent-client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
//! Interface for making API requests to a Sled Agent
use async_trait::async_trait;
use schemars::JsonSchema;
use serde::Deserialize;
use serde::Serialize;
use std::convert::TryFrom;
use uuid::Uuid;

Expand Down Expand Up @@ -162,10 +165,10 @@ impl From<types::SledInstanceState>
{
fn from(s: types::SledInstanceState) -> Self {
Self {
instance_state: s.instance_state.into(),
propolis_id: s.propolis_id,
vmm_state: s.vmm_state.into(),
migration_state: s.migration_state.map(Into::into),
migration_in: s.migration_in.map(Into::into),
migration_out: s.migration_out.map(Into::into),
}
}
}
Expand All @@ -177,25 +180,12 @@ impl From<types::MigrationRuntimeState>
Self {
migration_id: s.migration_id,
state: s.state.into(),
role: s.role.into(),
gen: s.gen,
time_updated: s.time_updated,
}
}
}

impl From<types::MigrationRole>
for omicron_common::api::internal::nexus::MigrationRole
{
fn from(r: types::MigrationRole) -> Self {
use omicron_common::api::internal::nexus::MigrationRole as Output;
match r {
types::MigrationRole::Source => Output::Source,
types::MigrationRole::Target => Output::Target,
}
}
}

impl From<types::MigrationState>
for omicron_common::api::internal::nexus::MigrationState
{
Expand Down Expand Up @@ -457,12 +447,29 @@ impl From<types::SledIdentifiers>
/// are bonus endpoints, not generated in the real client.
#[async_trait]
pub trait TestInterfaces {
async fn instance_single_step(&self, id: Uuid);
async fn instance_finish_transition(&self, id: Uuid);
async fn instance_simulate_migration_source(
&self,
id: Uuid,
params: SimulateMigrationSource,
);
async fn disk_finish_transition(&self, id: Uuid);
}

#[async_trait]
impl TestInterfaces for Client {
async fn instance_single_step(&self, id: Uuid) {
let baseurl = self.baseurl();
let client = self.client();
let url = format!("{}/instances/{}/poke-single-step", baseurl, id);
client
.post(url)
.send()
.await
.expect("instance_single_step() failed unexpectedly");
}

async fn instance_finish_transition(&self, id: Uuid) {
let baseurl = self.baseurl();
let client = self.client();
Expand All @@ -484,4 +491,46 @@ impl TestInterfaces for Client {
.await
.expect("disk_finish_transition() failed unexpectedly");
}

async fn instance_simulate_migration_source(
&self,
id: Uuid,
params: SimulateMigrationSource,
) {
let baseurl = self.baseurl();
let client = self.client();
let url = format!("{baseurl}/instances/{id}/sim-migration-source");
client
.post(url)
.json(&params)
.send()
.await
.expect("instance_simulate_migration_source() failed unexpectedly");
}
}

/// Parameters to the `/instances/{id}/sim-migration-source` test API.
///
/// This message type is not included in the OpenAPI spec, because this API
/// exists only in test builds.
#[derive(Serialize, Deserialize, JsonSchema)]
pub struct SimulateMigrationSource {
/// The ID of the migration out of the instance's current active VMM.
pub migration_id: Uuid,
/// What migration result (success or failure) to simulate.
pub result: SimulatedMigrationResult,
}

/// The result of a simulated migration out from an instance's current active
/// VMM.
#[derive(Serialize, Deserialize, JsonSchema)]
pub enum SimulatedMigrationResult {
/// Simulate a successful migration out.
Success,
/// Simulate a failed migration out.
///
/// # Note
///
/// This is not currently implemented by the simulated sled-agent.
Failure,
}
59 changes: 26 additions & 33 deletions common/src/api/internal/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -117,18 +117,38 @@ pub struct VmmRuntimeState {
/// specific VMM and the instance it incarnates.
#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct SledInstanceState {
/// The sled's conception of the state of the instance.
pub instance_state: InstanceRuntimeState,

/// The ID of the VMM whose state is being reported.
pub propolis_id: PropolisUuid,

/// The most recent state of the sled's VMM process.
pub vmm_state: VmmRuntimeState,

/// The current state of any in-progress migration for this instance, as
/// understood by this sled.
pub migration_state: Option<MigrationRuntimeState>,
/// The current state of any inbound migration to this VMM.
pub migration_in: Option<MigrationRuntimeState>,

/// The state of any outbound migration from this VMM.
pub migration_out: Option<MigrationRuntimeState>,
}

#[derive(Copy, Clone, Debug, Default)]
pub struct Migrations<'state> {
pub migration_in: Option<&'state MigrationRuntimeState>,
pub migration_out: Option<&'state MigrationRuntimeState>,
}

impl Migrations<'_> {
pub fn empty() -> Self {
Self { migration_in: None, migration_out: None }
}
}

impl SledInstanceState {
pub fn migrations(&self) -> Migrations<'_> {
Migrations {
migration_in: self.migration_in.as_ref(),
migration_out: self.migration_out.as_ref(),
}
}
}

/// An update from a sled regarding the state of a migration, indicating the
Expand All @@ -137,7 +157,6 @@ pub struct SledInstanceState {
pub struct MigrationRuntimeState {
pub migration_id: Uuid,
pub state: MigrationState,
pub role: MigrationRole,
pub gen: Generation,

/// Timestamp for the migration state update.
Expand Down Expand Up @@ -192,32 +211,6 @@ impl fmt::Display for MigrationState {
}
}

#[derive(
Clone, Copy, Debug, PartialEq, Eq, Deserialize, Serialize, JsonSchema,
)]
#[serde(rename_all = "snake_case")]
pub enum MigrationRole {
/// This update concerns the source VMM of a migration.
Source,
/// This update concerns the target VMM of a migration.
Target,
}

impl MigrationRole {
pub fn label(&self) -> &'static str {
match self {
Self::Source => "source",
Self::Target => "target",
}
}
}

impl fmt::Display for MigrationRole {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.write_str(self.label())
}
}

// Oximeter producer/collector objects.

/// The kind of metric producer this is.
Expand Down
88 changes: 82 additions & 6 deletions dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -929,6 +929,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
/// number of stale instance metrics that were deleted
pruned_instances: usize,

/// update sagas queued due to instance updates.
update_sagas_queued: usize,

/// instance states from completed checks.
///
/// this is a mapping of stringified instance states to the number
Expand Down Expand Up @@ -970,6 +973,7 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
),
Ok(TaskSuccess {
total_instances,
update_sagas_queued,
pruned_instances,
instance_states,
failed_checks,
Expand All @@ -987,7 +991,7 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
for (state, count) in &instance_states {
println!(" -> {count} instances {state}")
}

println!(" update sagas queued: {update_sagas_queued}");
println!(" failed checks: {total_failures}");
for (failure, count) in &failed_checks {
println!(" -> {count} {failure}")
Expand Down Expand Up @@ -1239,11 +1243,6 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
} else if name == "lookup_region_port" {
match serde_json::from_value::<LookupRegionPortStatus>(details.clone())
{
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),

Ok(LookupRegionPortStatus { found_port_ok, errors }) => {
println!(" total filled in ports: {}", found_port_ok.len());
for line in &found_port_ok {
Expand All @@ -1255,6 +1254,83 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
println!(" > {line}");
}
}

Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details,
),
}
} else if name == "instance_updater" {
#[derive(Deserialize)]
struct UpdaterStatus {
/// number of instances found with destroyed active VMMs
destroyed_active_vmms: usize,

/// number of instances found with terminated active migrations
terminated_active_migrations: usize,

/// number of update sagas started.
sagas_started: usize,

/// number of sagas completed successfully
sagas_completed: usize,

/// number of sagas which failed
sagas_failed: usize,

/// number of sagas which could not be started
saga_start_failures: usize,

/// the last error that occurred during execution.
error: Option<String>,
}
match serde_json::from_value::<UpdaterStatus>(details.clone()) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
Ok(UpdaterStatus {
destroyed_active_vmms,
terminated_active_migrations,
sagas_started,
sagas_completed,
sagas_failed,
saga_start_failures,
error,
}) => {
if let Some(error) = error {
println!(" task did not complete successfully!");
println!(" most recent error: {error}");
}

println!(
" total instances in need of updates: {}",
destroyed_active_vmms + terminated_active_migrations
);
println!(
" instances with destroyed active VMMs: {}",
destroyed_active_vmms,
);
println!(
" instances with terminated active migrations: {}",
terminated_active_migrations,
);
println!(" update sagas started: {sagas_started}");
println!(
" update sagas completed successfully: {}",
sagas_completed,
);

let total_failed = sagas_failed + saga_start_failures;
if total_failed > 0 {
println!(" unsuccessful update sagas: {total_failed}");
println!(
" sagas which could not be started: {}",
saga_start_failures
);
println!(" sagas failed: {sagas_failed}");
}
}
};
} else {
println!(
Expand Down
Loading

0 comments on commit 9b595e9

Please sign in to comment.