Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Reincarnate Failed instances #6503

Merged
merged 111 commits into from
Sep 23, 2024
Merged
Show file tree
Hide file tree
Changes from 102 commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
abfde82
[nexus] add instance_reincarnation RPW
hawkw Sep 1, 2024
29980dd
add super fancy OMDB output
hawkw Sep 1, 2024
a3574e6
update sagas trigger reincarnation for Failed instances
hawkw Sep 2, 2024
9652402
add a rudimentary background task unit test
hawkw Sep 2, 2024
9731a80
unit test ineligible instances are not reincarnated
hawkw Sep 2, 2024
f344612
nicer output for tests
hawkw Sep 2, 2024
ba72fc4
update to track `SledFailuresOnly` policy
hawkw Sep 2, 2024
fa61cfa
start adding auto-restart to instance-create
hawkw Sep 3, 2024
078b90b
post-rebase update for nullable restart policies
hawkw Sep 4, 2024
7a23242
need an auto-restart policy there
hawkw Sep 4, 2024
24ffa89
handle `Option` in update saga too (oops)
hawkw Sep 4, 2024
f38b42e
blergh
hawkw Sep 4, 2024
3f08a03
add `time_last_auto_restarted` to instance records
hawkw Sep 5, 2024
c10a63d
even more missing auto-restart policies
hawkw Sep 5, 2024
356e01d
set auto-restart time in start sagas if they are auto-restarts
hawkw Sep 5, 2024
dbab298
also include start reason in logs
hawkw Sep 5, 2024
79a746e
a bajillion more missing auto-restart policies
hawkw Sep 6, 2024
98d5fcc
remove unneeded mut
hawkw Sep 6, 2024
fb7f7ef
AGH I FORGOT THE ORDER OF THE `table!` MACRO IS LOAD BEARING
hawkw Sep 6, 2024
6374931
the migration needs to not be empty for it to work
hawkw Sep 6, 2024
531cc87
actually add chill-out time between reincarnations
hawkw Sep 6, 2024
ddebd49
add chill-out time in OMDB
hawkw Sep 6, 2024
e41b287
add an integration test
hawkw Sep 6, 2024
98b430d
just use SQL_BATCH_SIZE
hawkw Sep 9, 2024
41338eb
make cooldown period configurable
hawkw Sep 9, 2024
2e2ef38
wip test for cooldowns
hawkw Sep 9, 2024
84f1aca
post rebase fixy
hawkw Sep 9, 2024
4926d93
finish test
hawkw Sep 9, 2024
124cdd8
bblghghghghghh
hawkw Sep 9, 2024
be27ffa
blurgh
hawkw Sep 10, 2024
0dfecac
fix check constraint violation in test
hawkw Sep 10, 2024
df82948
ahh whoops
hawkw Sep 10, 2024
51fe472
you have to actually override the default if you want it to work
hawkw Sep 10, 2024
5f47cbe
oops, test should pretend instance 1 is started
hawkw Sep 10, 2024
5b99576
derp
hawkw Sep 10, 2024
703b17e
tests run real sagas to avoid check constraints
hawkw Sep 10, 2024
33a8e81
ensmallerate test instances
hawkw Sep 10, 2024
e6cf49f
whoops forgot to update omdb again
hawkw Sep 10, 2024
2bb380e
print each action in failing test to stderr
hawkw Sep 11, 2024
6bf7185
fix test racing with simulated sled-agent
hawkw Sep 11, 2024
1e95141
fix ugly schema for `InstanceAutoRestart` policies
hawkw Sep 12, 2024
0c758af
remove unused `Default` impl
hawkw Sep 12, 2024
244f4d8
review feedback: fix comments, treat all Conflicts the same
hawkw Sep 12, 2024
bd1150b
review feedback: fix typos
hawkw Sep 12, 2024
0e23172
review feedback: fix typo that's on two lines
hawkw Sep 12, 2024
f59c5ad
attempt to factor out reincarnation filtering
hawkw Sep 12, 2024
7239bb9
redo reincarnation query to be random
hawkw Sep 12, 2024
a7be567
cooling down is filtered out at the db level
hawkw Sep 12, 2024
1438054
fix tests
hawkw Sep 12, 2024
70ca1a5
remove instances_cooling_down
hawkw Sep 13, 2024
6fb14ac
actually wait for the sagas to finish before doing the next batch
hawkw Sep 13, 2024
62727f4
Revert randomized query thing
hawkw Sep 13, 2024
8e32938
I guess we can at least keep the cooldown query...
hawkw Sep 13, 2024
62ec4bd
ugh whoops
hawkw Sep 13, 2024
b60c3b2
my bad lol
hawkw Sep 13, 2024
5d759aa
update omdb
hawkw Sep 13, 2024
bb07983
add auto-restart timestamps to external api
hawkw Sep 13, 2024
e194428
placate clippy
hawkw Sep 13, 2024
bad5710
add concurrency limit, query order randomization
hawkw Sep 13, 2024
37789e5
review feedback: comment nits
hawkw Sep 13, 2024
a3ccabb
review feedback: logging tweaks
hawkw Sep 13, 2024
17efc98
QoS-oriented restart policies, rm SledFailuresOnly
hawkw Sep 14, 2024
ae1fdc6
&!@&$*&#($% TRAILING COMMA AGHHH
hawkw Sep 14, 2024
c296ef8
rework auto-restart cooldown
hawkw Sep 15, 2024
6d73fd7
redo cooldown even more
hawkw Sep 15, 2024
8bace8e
big ol' API rework
hawkw Sep 16, 2024
5677980
fixup tests
hawkw Sep 16, 2024
b83a0bd
make external API cooldowns num secs, add min
hawkw Sep 16, 2024
1affc77
lol some of the tests used `Never` for everything
hawkw Sep 16, 2024
f341d24
whoops 😅
hawkw Sep 16, 2024
58b6541
@gjcolombo's nicer comments for `BestEffort`
hawkw Sep 16, 2024
8ac1dec
integration test deflakement
hawkw Sep 16, 2024
7231647
diesel trait docs
hawkw Sep 17, 2024
de17f14
fix typo in migration (oops)
hawkw Sep 17, 2024
d3aa320
set restart timestamp in first start saga action
hawkw Sep 17, 2024
37fa327
represent cooldown views as expiration time
hawkw Sep 17, 2024
e4a1734
dbinit and migrations need columns in the same order
hawkw Sep 17, 2024
a4afff9
HAHAHA I just totally forgot to add the migration
hawkw Sep 17, 2024
cf777f8
auto_restart_v2 migration needs to drop old enum
hawkw Sep 17, 2024
4a2b0a5
policy THEN cooldown
hawkw Sep 17, 2024
d1a9329
remove cooldown from the public API for now
hawkw Sep 17, 2024
dc2e64c
update omdb success cases (oops)
hawkw Sep 17, 2024
c6fab96
this no longer exists
hawkw Sep 17, 2024
4fb9a96
rm accidentally committed file
hawkw Sep 18, 2024
a0256e0
explain query
hawkw Sep 18, 2024
38b003c
go back to paginating
hawkw Sep 18, 2024
930f376
oh right that was why we went back to paginating
hawkw Sep 18, 2024
1c0513b
add a nice index of instances by state
hawkw Sep 18, 2024
21ea63e
mechanically-separated migration product
hawkw Sep 18, 2024
2501ad3
remove EXPLAIN test
hawkw Sep 18, 2024
ec0fac8
remove unused import
hawkw Sep 19, 2024
b18be10
more descriptive comment of timedelta serialization
hawkw Sep 19, 2024
2ffed77
redo migration to get rid of ugly `_v2` suffix
hawkw Sep 19, 2024
0f61e48
remove more unused imports (oops)
hawkw Sep 19, 2024
becccf4
whoops i put the migration in the wrong place
hawkw Sep 19, 2024
dc11190
add data migration tests for auto-restart policy
hawkw Sep 20, 2024
467039b
change default policy to best-effort
hawkw Sep 20, 2024
3428500
fix comment
hawkw Sep 20, 2024
1485f72
add explicit disable flag in config
hawkw Sep 20, 2024
e0e8929
Merge remote-tracking branch 'origin' into eliza/instance-resurrection
hawkw Sep 20, 2024
3d5ccc2
fix typo in comment
hawkw Sep 20, 2024
a24eca2
typo fix changed openapi spec again
hawkw Sep 20, 2024
aea76aa
Use sprockets on the bootstrap network (#6485)
labbott Sep 20, 2024
435d2cd
Soft-delete volumes in a transaction, not CTE (#6623)
jmpesp Sep 20, 2024
307bfda
Return richer enum types from datastore functions (#6604)
jmpesp Sep 20, 2024
5a2e626
[omdb] Add `omdb instance info` command (#6610)
hawkw Sep 20, 2024
d066a4a
Talk to ClickHouse over the native protocol (#6584)
bnaecker Sep 20, 2024
cd3236d
Add inventory `Collection` to `BlueprintBuilder` (#6624)
andrewjstone Sep 20, 2024
efd0ffd
include reincarnation details in OMDB and logs
hawkw Sep 21, 2024
7f459b8
Merge branch 'main' into eliza/instance-resurrection
hawkw Sep 23, 2024
64033be
cargo fmt
hawkw Sep 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions common/src/api/external/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1158,6 +1158,12 @@ impl From<&InstanceCpuCount> for i64 {
pub struct InstanceRuntimeState {
pub run_state: InstanceState,
pub time_run_state_updated: DateTime<Utc>,
/// The timestamp of the most recent time this instance was automatically
/// restarted by the control plane.
///
/// If this is not present, then this instance has not been automatically
/// restarted.
pub time_last_auto_restarted: Option<DateTime<Utc>>,
}

/// View of an Instance
Expand All @@ -1179,6 +1185,52 @@ pub struct Instance {

#[serde(flatten)]
pub runtime: InstanceRuntimeState,

#[serde(flatten)]
pub auto_restart_status: InstanceAutoRestartStatus,
}

/// Status of control-plane driven automatic failure recovery for this instance.
#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct InstanceAutoRestartStatus {
/// `true` if this instance's auto-restart policy will permit the control
/// plane to automatically restart it if it enters the `Failed` state.
//
// Rename this field, as the struct is `#[serde(flatten)]`ed into the
// `Instance` type, and we would like the field to be prefixed with
// `auto_restart`.
#[serde(rename = "auto_restart_enabled")]
pub enabled: bool,

/// The time at which the auto-restart cooldown period for this instance
/// completes, permitting it to be automatically restarted again. If the
/// instance enters the `Failed` state, it will not be restarted until after
/// this time.
///
/// If this is not present, then either the instance has never been
/// automatically restarted, or the cooldown period has already expired,
/// allowing the instance to be restarted immediately if it fails.
//
// Rename this field, as the struct is `#[serde(flatten)]`ed into the
// `Instance` type, and we would like the field to be prefixed with
// `auto_restart`.
#[serde(rename = "auto_restart_cooldown_expiration")]
pub cooldown_expiration: Option<DateTime<Utc>>,
}

/// A policy determining when an instance should be automatically restarted by
/// the control plane.
#[derive(Copy, Clone, Debug, Deserialize, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub enum InstanceAutoRestartPolicy {
/// The instance should not be automatically restarted by the control plane
/// if it fails.
Never,
/// If this instance is running and unexpectedly fails (e.g. due to a host
/// software crash or unexpected host reboot), the control plane will make a
/// best-effort attempt to restart it. The control plane may choose not to
/// restart the instance to preserve the overall availability of the system.
BestEffort,
}

// DISKS
Expand Down
75 changes: 75 additions & 0 deletions dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ use nexus_db_queries::db::lookup::LookupPath;
use nexus_saga_recovery::LastPass;
use nexus_types::deployment::Blueprint;
use nexus_types::internal_api::background::AbandonedVmmReaperStatus;
use nexus_types::internal_api::background::InstanceReincarnationStatus;
use nexus_types::internal_api::background::InstanceUpdaterStatus;
use nexus_types::internal_api::background::LookupRegionPortStatus;
use nexus_types::internal_api::background::RegionReplacementDriverStatus;
Expand Down Expand Up @@ -1780,6 +1781,80 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
}
}
}
} else if name == "instance_reincarnation" {
match serde_json::from_value::<InstanceReincarnationStatus>(
details.clone(),
) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
Ok(InstanceReincarnationStatus {
instances_found,
instances_reincarnated,
changed_state,
query_error,
restart_errors,
}) => {
const FOUND: &'static str =
"instances eligible for reincarnation:";
const REINCARNATED: &'static str = " instances reincarnated:";
const CHANGED_STATE: &'static str =
" instances which changed state before they could be reincarnated:";
const ERRORS: &'static str =
" instances which failed to be reincarnated:";
const COOLDOWN_PERIOD: &'static str =
"default cooldown period:";
const WIDTH: usize = const_max_len(&[
FOUND,
REINCARNATED,
CHANGED_STATE,
ERRORS,
COOLDOWN_PERIOD,
]);
let n_restart_errors = restart_errors.len();
let n_restarted = instances_reincarnated.len();
let n_changed_state = changed_state.len();
println!(" {FOUND:<WIDTH$} {instances_found:>3}");
println!(" {REINCARNATED:<WIDTH$} {n_restarted:>3}");
println!(" {CHANGED_STATE:<WIDTH$} {n_changed_state:>3}",);
println!(" {ERRORS:<WIDTH$} {n_restart_errors:>3}");

if let Some(e) = query_error {
println!(
" an error occurred while searching for instances \
to reincarnate:\n {e}",
);
}

if n_restart_errors > 0 {
println!(
" errors occurred while restarting the following \
instances:"
);
for (id, error) in restart_errors {
println!(" > {id}: {error}");
}
}

if n_restarted > 0 {
println!(" the following instances have reincarnated:");
for id in instances_reincarnated {
println!(" > {id}")
}
}

if n_changed_state > 0 {
println!(
" the following instances states changed before \
they could be reincarnated:"
);
for id in changed_state {
println!(" > {id}")
}
}
}
};
} else {
println!(
"warning: unknown background task: {:?} \
Expand Down
15 changes: 15 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down Expand Up @@ -252,6 +257,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down Expand Up @@ -405,6 +415,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down
25 changes: 25 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,11 @@ task: "external_endpoints"
on each one


task: "instance_reincarnation"
schedules start sagas for failed instances that can be automatically
restarted


task: "instance_updater"
detects if instances require update sagas and schedules them

Expand Down Expand Up @@ -518,6 +523,16 @@ task: "external_endpoints"

TLS certificates: 0

task: "instance_reincarnation"
configured period: every 1m
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
instances eligible for reincarnation: 0
instances reincarnated: 0
instances which changed state before they could be reincarnated: 0
instances which failed to be reincarnated: 0

task: "instance_updater"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down Expand Up @@ -948,6 +963,16 @@ task: "external_endpoints"

TLS certificates: 0

task: "instance_reincarnation"
configured period: every 1m
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
instances eligible for reincarnation: 0
instances reincarnated: 0
instances which changed state before they could be reincarnated: 0
instances which failed to be reincarnated: 0

task: "instance_updater"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down
1 change: 1 addition & 0 deletions end-to-end-tests/src/instance_launch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ async fn instance_launch() -> Result<()> {
ssh_key_name.clone(),
)]),
start: true,
auto_restart_policy: Default::default(),
})
.send()
.await?;
Expand Down
25 changes: 25 additions & 0 deletions nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,8 @@ pub struct BackgroundTaskConfig {
pub instance_watcher: InstanceWatcherConfig,
/// configuration for instance updater task
pub instance_updater: InstanceUpdaterConfig,
/// configuration for instance reincarnation task
pub instance_reincarnation: InstanceReincarnationConfig,
/// configuration for service VPC firewall propagation task
pub service_firewall_propagation: ServiceFirewallPropagationConfig,
/// configuration for v2p mapping propagation task
Expand Down Expand Up @@ -590,6 +592,23 @@ pub struct InstanceUpdaterConfig {
pub disable: bool,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct InstanceReincarnationConfig {
/// period (in seconds) for periodic activations of this background task
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs: Duration,

/// disable background checks for instances in need of updates.
///
/// This is an emergency lever for support / operations. It should only be
/// necessary if something has gone extremely wrong.
///
/// Default: Off
#[serde(default)]
pub disable: bool,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct ServiceFirewallPropagationConfig {
Expand Down Expand Up @@ -912,6 +931,7 @@ mod test {
instance_watcher.period_secs = 30
instance_updater.period_secs = 30
instance_updater.disable = false
instance_reincarnation.period_secs = 67
service_firewall_propagation.period_secs = 300
v2p_mapping_propagation.period_secs = 30
abandoned_vmm_reaper.period_secs = 60
Expand Down Expand Up @@ -1067,6 +1087,10 @@ mod test {
period_secs: Duration::from_secs(30),
disable: false,
},
instance_reincarnation: InstanceReincarnationConfig {
period_secs: Duration::from_secs(67),
disable: false,
},
service_firewall_propagation:
ServiceFirewallPropagationConfig {
period_secs: Duration::from_secs(300),
Expand Down Expand Up @@ -1170,6 +1194,7 @@ mod test {
region_replacement_driver.period_secs = 30
instance_watcher.period_secs = 30
instance_updater.period_secs = 30
instance_reincarnation.period_secs = 67
service_firewall_propagation.period_secs = 300
v2p_mapping_propagation.period_secs = 30
abandoned_vmm_reaper.period_secs = 60
Expand Down
Loading
Loading