Skip to content

Commit

Permalink
[nexus] Reincarnate Failed instances (#6503)
Browse files Browse the repository at this point in the history
As of #6455, instances in the `Failed` state are allowed to be
restarted. As described in [RFD 486 § 4.3], Nexus should automatically
restart such instances when the instance record's `auto_restart_policy`
field (née `boot_on_fault`, see #6499) indicates that it is permitted to
do so. This branch implements this behavior, by adding a new
`instance_reincarnation` RPW. This RPW consists of a background task
which queries CRDB for instances in the `Failed` state which are
eligible to be automatically restarted, and starts a new
`instance-start` saga for any instances it discovers.

Instances are considered eligible for reincarnation if all of the
following conditions are true:

- **The instance is in the `InstanceState::Failed` state.** Meaning, it
  must have *no active VMM* --- if the instance is in the
  `InstanceState::Vmm` state with an active VMM which is in
  `VmmState::Failed`, that indicates no instance-update saga has run to
  clean up the old VMM's resources. In this case, we must wait until the
  instance record itself has transitioned to `Failed` before attempting
  to restart it.
- **The instance's `auto_restart_policy` permits it to be restarted.**
  The `auto_restart_policy` enum is changed from representing the
  classess of failures on which an instance may be restarted (`never`,
  `sled_failures_only`, and `all_failures`) was changed to a more
  general quality-of-service for automatic restarts. Presently, this can
  either be `never`, meaning that the instance can never be restarted
  automatically; or `best_effort`, meaning that the control plane should
  try to keep the instance running but is permitted to not restart it if
  necessary to preserve the stability of the whole system. The default
  policy for instances which don't provide one is `best_effort`.
- **A cooldown period has elapsed since the instance's last automatic
  restart.**. In order to prevent instances which fail frequently from
  compromising the reliability of the system as a whole, we enforce a
  cooldown period between automatic restarts. This is tracked by setting
  a last-auto-restart timestamp in the `instance` record. At present,
  the cooldown period is always one hour unless the instance record
  overrides it, which currently can only be done using a test-only Nexus
  method. In the future, we may consider allowing the cooldown period to
  be configured by users, but because it presents a denial-of-service
  risk, setting it should probably be a more priveliged operation than
  creating an instance.

The implementation of the actual reincarnation background task is
relatively straightforward: it runs a new
`DataStore::find_reincarnatable_instances` query, which returns
instances in the `Failed` state, and it spawns `instance-start` sagas
for any instances which satisfy the above conditions. The restart
cooldown is implemented by adding a new `time_last_auto_restarted` field
to the `instance` table and to the `InstanceRuntimeState` type. We add a
`Reason` enum to the `instance-start` saga's `Params` that indicates why
the instance is being started, and if it is `Reason::AutoRestart`, then
the start saga will set the `time_last_auto_restarted` timestamp when
moving the instance record to the `Starting` state. Additionally, I've
added the `auto_restart_policy` field to the `params::InstanceCreate`
type in the Nexus API so that instances can be created with auto-restart
policies. Reviewers should note that many of the files changed in this
diff were only modified by adding that field to a
`params::InstanceCreate` literal.

This branch adds a minimal, initial implementation of instance
auto-restarting.

[RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances

Fixes #5553
  • Loading branch information
hawkw authored Sep 23, 2024
1 parent 5e13d20 commit c2e718f
Show file tree
Hide file tree
Showing 68 changed files with 2,253 additions and 138 deletions.
52 changes: 52 additions & 0 deletions common/src/api/external/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1158,6 +1158,12 @@ impl From<&InstanceCpuCount> for i64 {
pub struct InstanceRuntimeState {
pub run_state: InstanceState,
pub time_run_state_updated: DateTime<Utc>,
/// The timestamp of the most recent time this instance was automatically
/// restarted by the control plane.
///
/// If this is not present, then this instance has not been automatically
/// restarted.
pub time_last_auto_restarted: Option<DateTime<Utc>>,
}

/// View of an Instance
Expand All @@ -1179,6 +1185,52 @@ pub struct Instance {

#[serde(flatten)]
pub runtime: InstanceRuntimeState,

#[serde(flatten)]
pub auto_restart_status: InstanceAutoRestartStatus,
}

/// Status of control-plane driven automatic failure recovery for this instance.
#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct InstanceAutoRestartStatus {
/// `true` if this instance's auto-restart policy will permit the control
/// plane to automatically restart it if it enters the `Failed` state.
//
// Rename this field, as the struct is `#[serde(flatten)]`ed into the
// `Instance` type, and we would like the field to be prefixed with
// `auto_restart`.
#[serde(rename = "auto_restart_enabled")]
pub enabled: bool,

/// The time at which the auto-restart cooldown period for this instance
/// completes, permitting it to be automatically restarted again. If the
/// instance enters the `Failed` state, it will not be restarted until after
/// this time.
///
/// If this is not present, then either the instance has never been
/// automatically restarted, or the cooldown period has already expired,
/// allowing the instance to be restarted immediately if it fails.
//
// Rename this field, as the struct is `#[serde(flatten)]`ed into the
// `Instance` type, and we would like the field to be prefixed with
// `auto_restart`.
#[serde(rename = "auto_restart_cooldown_expiration")]
pub cooldown_expiration: Option<DateTime<Utc>>,
}

/// A policy determining when an instance should be automatically restarted by
/// the control plane.
#[derive(Copy, Clone, Debug, Deserialize, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub enum InstanceAutoRestartPolicy {
/// The instance should not be automatically restarted by the control plane
/// if it fails.
Never,
/// If this instance is running and unexpectedly fails (e.g. due to a host
/// software crash or unexpected host reboot), the control plane will make a
/// best-effort attempt to restart it. The control plane may choose not to
/// restart the instance to preserve the overall availability of the system.
BestEffort,
}

// DISKS
Expand Down
42 changes: 38 additions & 4 deletions dev-tools/omdb/src/bin/omdb/db.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2881,7 +2881,9 @@ async fn cmd_db_instance_info(
instance::dsl as instance_dsl, migration::dsl as migration_dsl,
vmm::dsl as vmm_dsl,
};
use nexus_db_model::{Instance, InstanceRuntimeState, Migration, Vmm};
use nexus_db_model::{
Instance, InstanceKarmicStatus, InstanceRuntimeState, Migration, Vmm,
};
let InstanceInfoArgs { id } = args;

let instance = instance_dsl::instance
Expand Down Expand Up @@ -2926,7 +2928,7 @@ async fn cmd_db_instance_info(
// `nexus_db_model::Instance` type will want to make sure to update this
// code as well. Unfortunately, we can't just destructure the struct here to
// make sure this code breaks, since the `identity` field isn't public.
// So...just don't forget to do that, I guess.
// So...just don't forget to do that, I guess.
const ID: &'static str = "ID";
const PROJECT_ID: &'static str = "project ID";
const NAME: &'static str = "name";
Expand All @@ -2937,10 +2939,12 @@ async fn cmd_db_instance_info(
const VCPUS: &'static str = "vCPUs";
const MEMORY: &'static str = "memory";
const HOSTNAME: &'static str = "hostname";
const AUTO_RESTART: &'static str = "auto-restart policy";
const AUTO_RESTART: &'static str = "auto-restart";
const STATE: &'static str = "nexus state";
const LAST_MODIFIED: &'static str = "last modified at";
const LAST_UPDATED: &'static str = "last updated at";
const LAST_AUTO_RESTART: &'static str = "last auto-restarted at";
const KARMIC_STATUS: &'static str = "karmic status";
const ACTIVE_VMM: &'static str = "active VMM ID";
const TARGET_VMM: &'static str = "target VMM ID";
const MIGRATION_ID: &'static str = "migration ID";
Expand All @@ -2962,6 +2966,8 @@ async fn cmd_db_instance_info(
API_STATE,
LAST_UPDATED,
LAST_MODIFIED,
LAST_AUTO_RESTART,
KARMIC_STATUS,
ACTIVE_VMM,
TARGET_VMM,
MIGRATION_ID,
Expand All @@ -2970,6 +2976,17 @@ async fn cmd_db_instance_info(
MIGRATION_RECORD,
TARGET_VMM_RECORD,
]);

fn print_multiline_debug(slug: &str, thing: &impl core::fmt::Debug) {
println!(
" {slug:>WIDTH$}:\n{}",
textwrap::indent(
&format!("{thing:#?}"),
&" ".repeat(WIDTH - slug.len() + 8)
)
);
}

println!("\n{:=<80}", "== INSTANCE ");
println!(" {ID:>WIDTH$}: {}", instance.id());
println!(" {PROJECT_ID:>WIDTH$}: {}", instance.project_id);
Expand All @@ -2985,7 +3002,7 @@ async fn cmd_db_instance_info(
println!(" {VCPUS:>WIDTH$}: {}", instance.ncpus.0 .0);
println!(" {MEMORY:>WIDTH$}: {}", instance.memory.0);
println!(" {HOSTNAME:>WIDTH$}: {}", instance.hostname);
println!(" {AUTO_RESTART:>WIDTH$}: {:?}", instance.auto_restart_policy);
print_multiline_debug(AUTO_RESTART, &instance.auto_restart);
println!("\n{:=<80}", "== RUNTIME STATE ");
let InstanceRuntimeState {
time_updated,
Expand All @@ -2994,6 +3011,7 @@ async fn cmd_db_instance_info(
migration_id,
nexus_state,
r#gen,
time_last_auto_restarted,
} = instance.runtime_state;
println!(" {STATE:>WIDTH$}: {nexus_state:?}");
let effective_state = InstanceAndActiveVmm::determine_effective_state(
Expand All @@ -3008,6 +3026,22 @@ async fn cmd_db_instance_info(
" {LAST_UPDATED:>WIDTH$}: {time_updated:?} (generation {})",
r#gen.0
);
println!(" {LAST_AUTO_RESTART:>WIDTH$}: {time_last_auto_restarted:?}");
match instance.auto_restart.status(&instance.runtime_state) {
InstanceKarmicStatus::NotFailed => {}
InstanceKarmicStatus::Ready => {
println!("(i) {KARMIC_STATUS:>WIDTH$}: ready to reincarnate!");
}
InstanceKarmicStatus::Forbidden => {
println!("(i) {KARMIC_STATUS:>WIDTH$}: reincarnation forbidden");
}
InstanceKarmicStatus::CoolingDown(remaining) => {
println!(
"/!\\ {KARMIC_STATUS:>WIDTH$}: cooling down \
({remaining:?} remaining)"
);
}
}
println!(" {ACTIVE_VMM:>WIDTH$}: {propolis_id:?}");
println!(" {TARGET_VMM:>WIDTH$}: {dst_propolis_id:?}");
println!(
Expand Down
75 changes: 75 additions & 0 deletions dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ use nexus_db_queries::db::lookup::LookupPath;
use nexus_saga_recovery::LastPass;
use nexus_types::deployment::Blueprint;
use nexus_types::internal_api::background::AbandonedVmmReaperStatus;
use nexus_types::internal_api::background::InstanceReincarnationStatus;
use nexus_types::internal_api::background::InstanceUpdaterStatus;
use nexus_types::internal_api::background::LookupRegionPortStatus;
use nexus_types::internal_api::background::RegionReplacementDriverStatus;
Expand Down Expand Up @@ -1781,6 +1782,80 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
}
}
}
} else if name == "instance_reincarnation" {
match serde_json::from_value::<InstanceReincarnationStatus>(
details.clone(),
) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
Ok(InstanceReincarnationStatus {
instances_found,
instances_reincarnated,
changed_state,
query_error,
restart_errors,
}) => {
const FOUND: &'static str =
"instances eligible for reincarnation:";
const REINCARNATED: &'static str = " instances reincarnated:";
const CHANGED_STATE: &'static str =
" instances which changed state before they could be reincarnated:";
const ERRORS: &'static str =
" instances which failed to be reincarnated:";
const COOLDOWN_PERIOD: &'static str =
"default cooldown period:";
const WIDTH: usize = const_max_len(&[
FOUND,
REINCARNATED,
CHANGED_STATE,
ERRORS,
COOLDOWN_PERIOD,
]);
let n_restart_errors = restart_errors.len();
let n_restarted = instances_reincarnated.len();
let n_changed_state = changed_state.len();
println!(" {FOUND:<WIDTH$} {instances_found:>3}");
println!(" {REINCARNATED:<WIDTH$} {n_restarted:>3}");
println!(" {CHANGED_STATE:<WIDTH$} {n_changed_state:>3}",);
println!(" {ERRORS:<WIDTH$} {n_restart_errors:>3}");

if let Some(e) = query_error {
println!(
" an error occurred while searching for instances \
to reincarnate:\n {e}",
);
}

if n_restart_errors > 0 {
println!(
" errors occurred while restarting the following \
instances:"
);
for (id, error) in restart_errors {
println!(" > {id}: {error}");
}
}

if n_restarted > 0 {
println!(" the following instances have reincarnated:");
for id in instances_reincarnated {
println!(" > {id}")
}
}

if n_changed_state > 0 {
println!(
" the following instances states changed before \
they could be reincarnated:"
);
for id in changed_state {
println!(" > {id}")
}
}
}
};
} else {
println!(
"warning: unknown background task: {:?} \
Expand Down
15 changes: 15 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down Expand Up @@ -252,6 +257,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down Expand Up @@ -405,6 +415,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down
25 changes: 25 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down Expand Up @@ -518,6 +523,16 @@ task: "external_endpoints"

TLS certificates: 0

task: "instance_reincarnation"
configured period: every 1m
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
instances eligible for reincarnation: 0
instances reincarnated: 0
instances which changed state before they could be reincarnated: 0
instances which failed to be reincarnated: 0

task: "instance_updater"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down Expand Up @@ -948,6 +963,16 @@ task: "external_endpoints"

TLS certificates: 0

task: "instance_reincarnation"
configured period: every 1m
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
instances eligible for reincarnation: 0
instances reincarnated: 0
instances which changed state before they could be reincarnated: 0
instances which failed to be reincarnated: 0

task: "instance_updater"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down
1 change: 1 addition & 0 deletions end-to-end-tests/src/instance_launch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ async fn instance_launch() -> Result<()> {
ssh_key_name.clone(),
)]),
start: true,
auto_restart_policy: Default::default(),
})
.send()
.await?;
Expand Down
Loading

0 comments on commit c2e718f

Please sign in to comment.