[nexus] Reincarnate Failed instances (#6503)

As of #6455, instances in the `Failed` state are allowed to be restarted. As described in [RFD 486 § 4.3], Nexus should automatically restart such instances when the instance record's `auto_restart_policy` field (née `boot_on_fault`, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a new `instance_reincarnation` RPW. This RPW consists of a background task which queries CRDB for instances in the `Failed` state which are eligible to be automatically restarted, and starts a new `instance-start` saga for any instances it discovers. Instances are considered eligible for reincarnation if all of the following conditions are true: - **The instance is in the `InstanceState::Failed` state.** Meaning, it must have *no active VMM* --- if the instance is in the `InstanceState::Vmm` state with an active VMM which is in `VmmState::Failed`, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned to `Failed` before attempting to restart it. - **The instance's `auto_restart_policy` permits it to be restarted.** The `auto_restart_policy` enum is changed from representing the classess of failures on which an instance may be restarted (`never`, `sled_failures_only`, and `all_failures`) was changed to a more general quality-of-service for automatic restarts. Presently, this can either be `never`, meaning that the instance can never be restarted automatically; or `best_effort`, meaning that the control plane should try to keep the instance running but is permitted to not restart it if necessary to preserve the stability of the whole system. The default policy for instances which don't provide one is `best_effort`. - **A cooldown period has elapsed since the instance's last automatic restart.**. In order to prevent instances which fail frequently from compromising the reliability of the system as a whole, we enforce a cooldown period between automatic restarts. This is tracked by setting a last-auto-restart timestamp in the `instance` record. At present, the cooldown period is always one hour unless the instance record overrides it, which currently can only be done using a test-only Nexus method. In the future, we may consider allowing the cooldown period to be configured by users, but because it presents a denial-of-service risk, setting it should probably be a more priveliged operation than creating an instance. The implementation of the actual reincarnation background task is relatively straightforward: it runs a new `DataStore::find_reincarnatable_instances` query, which returns instances in the `Failed` state, and it spawns `instance-start` sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a new `time_last_auto_restarted` field to the `instance` table and to the `InstanceRuntimeState` type. We add a `Reason` enum to the `instance-start` saga's `Params` that indicates why the instance is being started, and if it is `Reason::AutoRestart`, then the start saga will set the `time_last_auto_restarted` timestamp when moving the instance record to the `Starting` state. Additionally, I've added the `auto_restart_policy` field to the `params::InstanceCreate` type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to a `params::InstanceCreate` literal. This branch adds a minimal, initial implementation of instance auto-restarting. [RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances Fixes #5553
oxidecomputer · Sep 23, 2024 · c2e718f · c2e718f
1 parent 5e13d20
commit c2e718f
Show file tree

Hide file tree

Showing 68 changed files with 2,253 additions and 138 deletions.
diff --git a/common/src/api/external/mod.rs b/common/src/api/external/mod.rs
@@ -1158,6 +1158,12 @@ impl From<&InstanceCpuCount> for i64 {
 pub struct InstanceRuntimeState {
     pub run_state: InstanceState,
     pub time_run_state_updated: DateTime<Utc>,
+    /// The timestamp of the most recent time this instance was automatically
+    /// restarted by the control plane.
+    ///
+    /// If this is not present, then this instance has not been automatically
+    /// restarted.
+    pub time_last_auto_restarted: Option<DateTime<Utc>>,
 }
 
 /// View of an Instance
@@ -1179,6 +1185,52 @@ pub struct Instance {
 
     #[serde(flatten)]
     pub runtime: InstanceRuntimeState,
+
+    #[serde(flatten)]
+    pub auto_restart_status: InstanceAutoRestartStatus,
+}
+
+/// Status of control-plane driven automatic failure recovery for this instance.
+#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
+pub struct InstanceAutoRestartStatus {
+    /// `true` if this instance's auto-restart policy will permit the control
+    /// plane to automatically restart it if it enters the `Failed` state.
+    //
+    // Rename this field, as the struct is `#[serde(flatten)]`ed into the
+    // `Instance` type, and we would like the field to be prefixed with
+    // `auto_restart`.
+    #[serde(rename = "auto_restart_enabled")]
+    pub enabled: bool,
+
+    /// The time at which the auto-restart cooldown period for this instance
+    /// completes, permitting it to be automatically restarted again. If the
+    /// instance enters the `Failed` state, it will not be restarted until after
+    /// this time.
+    ///
+    /// If this is not present, then either the instance has never been
+    /// automatically restarted, or the cooldown period has already expired,
+    /// allowing the instance to be restarted immediately if it fails.
+    //
+    // Rename this field, as the struct is `#[serde(flatten)]`ed into the
+    // `Instance` type, and we would like the field to be prefixed with
+    // `auto_restart`.
+    #[serde(rename = "auto_restart_cooldown_expiration")]
+    pub cooldown_expiration: Option<DateTime<Utc>>,
+}
+
+/// A policy determining when an instance should be automatically restarted by
+/// the control plane.
+#[derive(Copy, Clone, Debug, Deserialize, Serialize, JsonSchema)]
+#[serde(rename_all = "snake_case")]
+pub enum InstanceAutoRestartPolicy {
+    /// The instance should not be automatically restarted by the control plane
+    /// if it fails.
+    Never,
+    /// If this instance is running and unexpectedly fails (e.g. due to a host
+    /// software crash or unexpected host reboot), the control plane will make a
+    /// best-effort attempt to restart it. The control plane may choose not to
+    /// restart the instance to preserve the overall availability of the system.
+    BestEffort,
 }
 
 // DISKS

diff --git a/dev-tools/omdb/src/bin/omdb/db.rs b/dev-tools/omdb/src/bin/omdb/db.rs
@@ -2881,7 +2881,9 @@ async fn cmd_db_instance_info(
         instance::dsl as instance_dsl, migration::dsl as migration_dsl,
         vmm::dsl as vmm_dsl,
     };
-    use nexus_db_model::{Instance, InstanceRuntimeState, Migration, Vmm};
+    use nexus_db_model::{
+        Instance, InstanceKarmicStatus, InstanceRuntimeState, Migration, Vmm,
+    };
     let InstanceInfoArgs { id } = args;
 
     let instance = instance_dsl::instance
@@ -2926,7 +2928,7 @@ async fn cmd_db_instance_info(
     // `nexus_db_model::Instance` type will want to make sure to update this
     // code as well. Unfortunately, we can't just destructure the struct here to
     // make sure this code breaks, since the `identity` field isn't public.
-    // So...just don't forget to do that, I  guess.
+    // So...just don't forget to do that, I guess.
     const ID: &'static str = "ID";
     const PROJECT_ID: &'static str = "project ID";
     const NAME: &'static str = "name";
@@ -2937,10 +2939,12 @@ async fn cmd_db_instance_info(
     const VCPUS: &'static str = "vCPUs";
     const MEMORY: &'static str = "memory";
     const HOSTNAME: &'static str = "hostname";
-    const AUTO_RESTART: &'static str = "auto-restart policy";
+    const AUTO_RESTART: &'static str = "auto-restart";
     const STATE: &'static str = "nexus state";
     const LAST_MODIFIED: &'static str = "last modified at";
     const LAST_UPDATED: &'static str = "last updated at";
+    const LAST_AUTO_RESTART: &'static str = "last auto-restarted at";
+    const KARMIC_STATUS: &'static str = "karmic status";
     const ACTIVE_VMM: &'static str = "active VMM ID";
     const TARGET_VMM: &'static str = "target VMM ID";
     const MIGRATION_ID: &'static str = "migration ID";
@@ -2962,6 +2966,8 @@ async fn cmd_db_instance_info(
         API_STATE,
         LAST_UPDATED,
         LAST_MODIFIED,
+        LAST_AUTO_RESTART,
+        KARMIC_STATUS,
         ACTIVE_VMM,
         TARGET_VMM,
         MIGRATION_ID,
@@ -2970,6 +2976,17 @@ async fn cmd_db_instance_info(
         MIGRATION_RECORD,
         TARGET_VMM_RECORD,
     ]);
+
+    fn print_multiline_debug(slug: &str, thing: &impl core::fmt::Debug) {
+        println!(
+            "    {slug:>WIDTH$}:\n{}",
+            textwrap::indent(
+                &format!("{thing:#?}"),
+                &" ".repeat(WIDTH - slug.len() + 8)
+            )
+        );
+    }
+
     println!("\n{:=<80}", "== INSTANCE ");
     println!("    {ID:>WIDTH$}: {}", instance.id());
     println!("    {PROJECT_ID:>WIDTH$}: {}", instance.project_id);
@@ -2985,7 +3002,7 @@ async fn cmd_db_instance_info(
     println!("    {VCPUS:>WIDTH$}: {}", instance.ncpus.0 .0);
     println!("    {MEMORY:>WIDTH$}: {}", instance.memory.0);
     println!("    {HOSTNAME:>WIDTH$}: {}", instance.hostname);
-    println!("    {AUTO_RESTART:>WIDTH$}: {:?}", instance.auto_restart_policy);
+    print_multiline_debug(AUTO_RESTART, &instance.auto_restart);
     println!("\n{:=<80}", "== RUNTIME STATE ");
     let InstanceRuntimeState {
         time_updated,
@@ -2994,6 +3011,7 @@ async fn cmd_db_instance_info(
         migration_id,
         nexus_state,
         r#gen,
+        time_last_auto_restarted,
     } = instance.runtime_state;
     println!("    {STATE:>WIDTH$}: {nexus_state:?}");
     let effective_state = InstanceAndActiveVmm::determine_effective_state(
@@ -3008,6 +3026,22 @@ async fn cmd_db_instance_info(
         "    {LAST_UPDATED:>WIDTH$}: {time_updated:?} (generation {})",
         r#gen.0
     );
+    println!("    {LAST_AUTO_RESTART:>WIDTH$}: {time_last_auto_restarted:?}");
+    match instance.auto_restart.status(&instance.runtime_state) {
+        InstanceKarmicStatus::NotFailed => {}
+        InstanceKarmicStatus::Ready => {
+            println!("(i) {KARMIC_STATUS:>WIDTH$}: ready to reincarnate!");
+        }
+        InstanceKarmicStatus::Forbidden => {
+            println!("(i) {KARMIC_STATUS:>WIDTH$}: reincarnation forbidden");
+        }
+        InstanceKarmicStatus::CoolingDown(remaining) => {
+            println!(
+                "/!\\ {KARMIC_STATUS:>WIDTH$}: cooling down \
+                 ({remaining:?} remaining)"
+            );
+        }
+    }
     println!("    {ACTIVE_VMM:>WIDTH$}: {propolis_id:?}");
     println!("    {TARGET_VMM:>WIDTH$}: {dst_propolis_id:?}");
     println!(

diff --git a/dev-tools/omdb/src/bin/omdb/nexus.rs b/dev-tools/omdb/src/bin/omdb/nexus.rs
@@ -36,6 +36,7 @@ use nexus_db_queries::db::lookup::LookupPath;
 use nexus_saga_recovery::LastPass;
 use nexus_types::deployment::Blueprint;
 use nexus_types::internal_api::background::AbandonedVmmReaperStatus;
+use nexus_types::internal_api::background::InstanceReincarnationStatus;
 use nexus_types::internal_api::background::InstanceUpdaterStatus;
 use nexus_types::internal_api::background::LookupRegionPortStatus;
 use nexus_types::internal_api::background::RegionReplacementDriverStatus;
@@ -1781,6 +1782,80 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
                 }
             }
         }
+    } else if name == "instance_reincarnation" {
+        match serde_json::from_value::<InstanceReincarnationStatus>(
+            details.clone(),
+        ) {
+            Err(error) => eprintln!(
+                "warning: failed to interpret task details: {:?}: {:?}",
+                error, details
+            ),
+            Ok(InstanceReincarnationStatus {
+                instances_found,
+                instances_reincarnated,
+                changed_state,
+                query_error,
+                restart_errors,
+            }) => {
+                const FOUND: &'static str =
+                    "instances eligible for reincarnation:";
+                const REINCARNATED: &'static str = "  instances reincarnated:";
+                const CHANGED_STATE: &'static str =
+                    "  instances which changed state before they could be reincarnated:";
+                const ERRORS: &'static str =
+                    "  instances which failed to be reincarnated:";
+                const COOLDOWN_PERIOD: &'static str =
+                    "default cooldown period:";
+                const WIDTH: usize = const_max_len(&[
+                    FOUND,
+                    REINCARNATED,
+                    CHANGED_STATE,
+                    ERRORS,
+                    COOLDOWN_PERIOD,
+                ]);
+                let n_restart_errors = restart_errors.len();
+                let n_restarted = instances_reincarnated.len();
+                let n_changed_state = changed_state.len();
+                println!("    {FOUND:<WIDTH$} {instances_found:>3}");
+                println!("    {REINCARNATED:<WIDTH$} {n_restarted:>3}");
+                println!("    {CHANGED_STATE:<WIDTH$} {n_changed_state:>3}",);
+                println!("    {ERRORS:<WIDTH$} {n_restart_errors:>3}");
+
+                if let Some(e) = query_error {
+                    println!(
+                        "    an error occurred while searching for instances \
+                         to reincarnate:\n      {e}",
+                    );
+                }
+
+                if n_restart_errors > 0 {
+                    println!(
+                        "    errors occurred while restarting the following \
+                         instances:"
+                    );
+                    for (id, error) in restart_errors {
+                        println!("    > {id}: {error}");
+                    }
+                }
+
+                if n_restarted > 0 {
+                    println!("    the following instances have reincarnated:");
+                    for id in instances_reincarnated {
+                        println!("    > {id}")
+                    }
+                }
+
+                if n_changed_state > 0 {
+                    println!(
+                        "    the following instances states changed before \
+                         they could be reincarnated:"
+                    );
+                    for id in changed_state {
+                        println!("    > {id}")
+                    }
+                }
+            }
+        };
     } else {
         println!(
             "warning: unknown background task: {:?} \

diff --git a/dev-tools/omdb/tests/env.out b/dev-tools/omdb/tests/env.out
@@ -86,6 +86,11 @@ task: "external_endpoints"
     on each one
 
 
+task: "instance_reincarnation"
+    schedules start sagas for failed instances that can be automatically
+    restarted
+
+
 task: "instance_updater"
     detects if instances require update sagas and schedules them
 
@@ -252,6 +257,11 @@ task: "external_endpoints"
     on each one
 
 
+task: "instance_reincarnation"
+    schedules start sagas for failed instances that can be automatically
+    restarted
+
+
 task: "instance_updater"
     detects if instances require update sagas and schedules them
 
@@ -405,6 +415,11 @@ task: "external_endpoints"
     on each one
 
 
+task: "instance_reincarnation"
+    schedules start sagas for failed instances that can be automatically
+    restarted
+
+
 task: "instance_updater"
     detects if instances require update sagas and schedules them
 

diff --git a/dev-tools/omdb/tests/successes.out b/dev-tools/omdb/tests/successes.out
@@ -302,6 +302,11 @@ task: "external_endpoints"
     on each one
 
 
+task: "instance_reincarnation"
+    schedules start sagas for failed instances that can be automatically
+    restarted
+
+
 task: "instance_updater"
     detects if instances require update sagas and schedules them
 
@@ -518,6 +523,16 @@ task: "external_endpoints"
 
     TLS certificates: 0
 
+task: "instance_reincarnation"
+  configured period: every 1m
+  currently executing: no
+  last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
+    started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
+    instances eligible for reincarnation:                                0
+      instances reincarnated:                                            0
+      instances which changed state before they could be reincarnated:   0
+      instances which failed to be reincarnated:                         0
+
 task: "instance_updater"
   configured period: every <REDACTED_DURATION>s
   currently executing: no
@@ -948,6 +963,16 @@ task: "external_endpoints"
 
     TLS certificates: 0
 
+task: "instance_reincarnation"
+  configured period: every 1m
+  currently executing: no
+  last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
+    started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
+    instances eligible for reincarnation:                                0
+      instances reincarnated:                                            0
+      instances which changed state before they could be reincarnated:   0
+      instances which failed to be reincarnated:                         0
+
 task: "instance_updater"
   configured period: every <REDACTED_DURATION>s
   currently executing: no

diff --git a/end-to-end-tests/src/instance_launch.rs b/end-to-end-tests/src/instance_launch.rs
@@ -76,6 +76,7 @@ async fn instance_launch() -> Result<()> {
                 ssh_key_name.clone(),
             )]),
             start: true,
+            auto_restart_policy: Default::default(),
         })
         .send()
         .await?;