Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] handle sled-agent errors as described in RFD 486 #6455

Merged
merged 25 commits into from
Sep 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8e36602
Change `mark_instance_failed` to `mark_vmm_failed`
hawkw Aug 26, 2024
5a8e698
Handle `Failed` active VMMs in instance-update saga
hawkw Aug 26, 2024
5247f66
janky first pass on hacking up instance_watcher
hawkw Aug 26, 2024
fd6a2c2
make sim sled agent return realistic error codes
hawkw Aug 27, 2024
01e06ea
add tests for (some) new failed instance behavior
hawkw Aug 27, 2024
36cb7ae
allow failed instances to be restarted, don't go to failed until rest…
hawkw Aug 27, 2024
e1c968a
run update saga synchronously in `mark_vmm_failed`
hawkw Aug 28, 2024
c8f96e3
update saga tests should expect failed vmms to be torn down
hawkw Aug 28, 2024
c6d4fe0
add `VmmState::Creating`
hawkw Aug 28, 2024
12a467f
rustfmt...
hawkw Aug 28, 2024
84d5235
okay FINE (i really hope this is safe)
hawkw Aug 28, 2024
816448c
back out some unnecessary changes
hawkw Aug 29, 2024
68da5b3
remove "put" from the thing we use for health checks
hawkw Aug 29, 2024
51f2c78
return the error codes described in RFD 486
hawkw Aug 29, 2024
0d7635c
remove some defunct comments
hawkw Aug 30, 2024
e5db540
whoops, the error type change broke this
hawkw Aug 30, 2024
4ece4de
more tests
hawkw Aug 30, 2024
ace4faa
add remaining tests
hawkw Aug 30, 2024
090b8d2
also return gateway timeout
hawkw Aug 30, 2024
62caa90
ah, i had gotten the error conversion rules wrong
hawkw Aug 30, 2024
288d20b
instances with failed vmms should be candidates for background update
hawkw Aug 30, 2024
e6e8cb6
address @gjcolombo's review feedback
hawkw Aug 30, 2024
01bd26c
return 503 when instance is wayyyyy gone
hawkw Aug 30, 2024
f171767
purge VmmState::Creating from APIs
hawkw Aug 30, 2024
1b3fcad
blehhh rustfmt
hawkw Aug 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions common/src/api/internal/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,17 @@ pub enum VmmState {
Destroyed,
}

impl VmmState {
/// States in which the VMM no longer exists and must be cleaned up.
pub const TERMINAL_STATES: &'static [Self] =
&[Self::Failed, Self::Destroyed];

/// Returns `true` if this VMM is in a terminal state.
pub fn is_terminal(&self) -> bool {
Self::TERMINAL_STATES.contains(self)
}
}

/// The dynamic runtime properties of an individual VMM process.
#[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct VmmRuntimeState {
Expand Down
10 changes: 9 additions & 1 deletion dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1394,6 +1394,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
/// number of instances found with destroyed active VMMs
destroyed_active_vmms: usize,

/// number of instances found with failed active VMMs
failed_active_vmms: usize,

/// number of instances found with terminated active migrations
terminated_active_migrations: usize,

Expand All @@ -1419,6 +1422,7 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
),
Ok(UpdaterStatus {
destroyed_active_vmms,
failed_active_vmms,
terminated_active_migrations,
sagas_started,
sagas_completed,
Expand All @@ -1436,9 +1440,13 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
destroyed_active_vmms + terminated_active_migrations
);
println!(
" instances with destroyed active VMMs: {}",
" instances with Destroyed active VMMs: {}",
destroyed_active_vmms,
);
println!(
" instances with Failed active VMMs: {}",
failed_active_vmms,
);
println!(
" instances with terminated active migrations: {}",
terminated_active_migrations,
Expand Down
6 changes: 4 additions & 2 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -524,7 +524,8 @@ task: "instance_updater"
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
total instances in need of updates: 0
instances with destroyed active VMMs: 0
instances with Destroyed active VMMs: 0
instances with Failed active VMMs: 0
instances with terminated active migrations: 0
update sagas started: 0
update sagas completed successfully: 0
Expand Down Expand Up @@ -950,7 +951,8 @@ task: "instance_updater"
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
total instances in need of updates: 0
instances with destroyed active VMMs: 0
instances with Destroyed active VMMs: 0
instances with Failed active VMMs: 0
instances with terminated active migrations: 0
update sagas started: 0
update sagas completed successfully: 0
Expand Down
3 changes: 2 additions & 1 deletion nexus/db-model/src/schema_versions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ use std::collections::BTreeMap;
///
/// This must be updated when you change the database schema. Refer to
/// schema/crdb/README.adoc in the root of this repository for details.
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(93, 0, 0);
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(94, 0, 0);

/// List of all past database schema versions, in *reverse* order
///
Expand All @@ -29,6 +29,7 @@ static KNOWN_VERSIONS: Lazy<Vec<KnownVersion>> = Lazy::new(|| {
// | leaving the first copy as an example for the next person.
// v
// KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
KnownVersion::new(94, "put-back-creating-vmm-state"),
KnownVersion::new(93, "dataset-kinds-zone-and-debug"),
KnownVersion::new(92, "lldp-link-config-nullable"),
KnownVersion::new(91, "add-management-gateway-producer-kind"),
Expand Down
16 changes: 4 additions & 12 deletions nexus/db-model/src/vmm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -60,27 +60,19 @@ pub struct Vmm {
pub runtime: VmmRuntimeState,
}

/// The set of states that a VMM can have when it is created.
pub enum VmmInitialState {
Starting,
Migrating,
}

impl Vmm {
/// Creates a new VMM record.
///
/// The new VMM record will be in [`VmmState::Creating`] until it is
/// registered with a sled-agent.
pub fn new(
id: PropolisUuid,
instance_id: InstanceUuid,
sled_id: SledUuid,
propolis_ip: ipnetwork::IpNetwork,
propolis_port: u16,
initial_state: VmmInitialState,
) -> Self {
let now = Utc::now();
let state = match initial_state {
VmmInitialState::Starting => VmmState::Starting,
VmmInitialState::Migrating => VmmState::Migrating,
};

Self {
id: id.into_untyped_uuid(),
Expand All @@ -91,7 +83,7 @@ impl Vmm {
propolis_ip,
propolis_port: SqlU16(propolis_port),
runtime: VmmRuntimeState {
state,
state: VmmState::Creating,
time_state_updated: now,
gen: Generation::new(),
},
Expand Down
61 changes: 54 additions & 7 deletions nexus/db-model/src/vmm_state.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

use super::impl_enum_type;
use omicron_common::api::internal::nexus::VmmState as ApiState;
use serde::Deserialize;
use serde::Serialize;
use std::fmt;
Expand All @@ -16,6 +17,7 @@ impl_enum_type!(
#[diesel(sql_type = VmmStateEnum)]
pub enum VmmState;

Creating => b"creating"
Starting => b"starting"
Running => b"running"
Stopping => b"stopping"
Expand All @@ -30,6 +32,7 @@ impl_enum_type!(
impl VmmState {
pub fn label(&self) -> &'static str {
match self {
VmmState::Creating => "creating",
VmmState::Starting => "starting",
VmmState::Running => "running",
VmmState::Stopping => "stopping",
Expand All @@ -45,7 +48,26 @@ impl VmmState {
/// States in which it is safe to deallocate a VMM's sled resources and mark
/// it as deleted.
pub const DESTROYABLE_STATES: &'static [Self] =
&[Self::Destroyed, Self::SagaUnwound];
&[Self::Destroyed, Self::Failed, Self::SagaUnwound];

pub const TERMINAL_STATES: &'static [Self] =
&[Self::Destroyed, Self::Failed];

/// States in which a VMM record is present in the database but is not
/// resident on a sled, either because it does not yet exist, was produced
/// by an unwound update saga and will never exist, or has already been
/// destroyed.
pub const NONEXISTENT_STATES: &'static [Self] =
&[Self::Creating, Self::SagaUnwound, Self::Destroyed];

pub fn is_terminal(&self) -> bool {
Self::TERMINAL_STATES.contains(self)
}
/// Returns `true` if the instance is in a state in which it exists on a
/// sled.
pub fn exists_on_sled(&self) -> bool {
!Self::NONEXISTENT_STATES.contains(self)
}
}

impl fmt::Display for VmmState {
Expand All @@ -58,7 +80,9 @@ impl From<VmmState> for omicron_common::api::internal::nexus::VmmState {
fn from(value: VmmState) -> Self {
use omicron_common::api::internal::nexus::VmmState as Output;
match value {
VmmState::Starting => Output::Starting,
// The `Creating` state is internal to Nexus; the outside world
// should treat it as equivalent to `Starting`.
VmmState::Creating | VmmState::Starting => Output::Starting,
VmmState::Running => Output::Running,
VmmState::Stopping => Output::Stopping,
VmmState::Stopped => Output::Stopped,
Expand All @@ -74,7 +98,9 @@ impl From<VmmState> for sled_agent_client::types::VmmState {
fn from(value: VmmState) -> Self {
use sled_agent_client::types::VmmState as Output;
match value {
VmmState::Starting => Output::Starting,
// The `Creating` state is internal to Nexus; the outside world
// should treat it as equivalent to `Starting`.
VmmState::Creating | VmmState::Starting => Output::Starting,
VmmState::Running => Output::Running,
VmmState::Stopping => Output::Stopping,
VmmState::Stopped => Output::Stopped,
Expand All @@ -86,9 +112,8 @@ impl From<VmmState> for sled_agent_client::types::VmmState {
}
}

impl From<omicron_common::api::internal::nexus::VmmState> for VmmState {
fn from(value: omicron_common::api::internal::nexus::VmmState) -> Self {
use omicron_common::api::internal::nexus::VmmState as ApiState;
impl From<ApiState> for VmmState {
fn from(value: ApiState) -> Self {
use VmmState as Output;
match value {
ApiState::Starting => Output::Starting,
Expand All @@ -108,7 +133,12 @@ impl From<VmmState> for omicron_common::api::external::InstanceState {
use omicron_common::api::external::InstanceState as Output;

match value {
VmmState::Starting => Output::Starting,
// An instance with a VMM which is in the `Creating` state maps to
// `InstanceState::Starting`, rather than `InstanceState::Creating`.
// If we are still creating the VMM, this is because we are
// attempting to *start* the instance; instances may be created
// without creating a VMM to run them, and then started later.
VmmState::Creating | VmmState::Starting => Output::Starting,
VmmState::Running => Output::Running,
VmmState::Stopping => Output::Stopping,
// `SagaUnwound` should map to `Stopped` so that an `instance_view`
Expand All @@ -129,3 +159,20 @@ impl diesel::query_builder::QueryId for VmmStateEnum {
type QueryId = ();
const HAS_STATIC_QUERY_ID: bool = false;
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_all_terminal_api_states_are_terminal_db_states() {
hawkw marked this conversation as resolved.
Show resolved Hide resolved
for &api_state in ApiState::TERMINAL_STATES {
let db_state = VmmState::from(api_state);
assert!(
db_state.is_terminal(),
"API VmmState::{api_state:?} is considered terminal, but its \
corresponding DB state ({db_state:?}) is not!"
);
}
}
}
29 changes: 22 additions & 7 deletions nexus/db-queries/src/db/datastore/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,13 @@ impl InstanceAndActiveVmm {
(InstanceState::Vmm, Some(VmmState::SagaUnwound)) => {
external::InstanceState::Stopped
}
// - An instance with a "failed" VMM should *not* be counted as
// failed until the VMM is unlinked, because a start saga must be
// able to run "failed" instance. Until then, it will continue to
// appear "stopping".
(InstanceState::Vmm, Some(VmmState::Failed)) => {
external::InstanceState::Stopping
}
// - An instance with no VMM is always "stopped" (as long as it's
// not "starting" etc.)
(InstanceState::NoVmm, _vmm_state) => {
Expand Down Expand Up @@ -360,21 +367,21 @@ impl DataStore {
.collect())
}

/// List all instances with active VMMs in the `Destroyed` state that don't
/// have currently-running instance-updater sagas.
/// List all instances with active VMMs in the provided [`VmmState`] which
/// don't have currently-running instance-updater sagas.
///
/// This is used by the `instance_updater` background task to ensure that
/// update sagas are scheduled for these instances.
pub async fn find_instances_with_destroyed_active_vmms(
pub async fn find_instances_by_active_vmm_state(
&self,
opctx: &OpContext,
vmm_state: VmmState,
) -> ListResultVec<Instance> {
use db::model::VmmState;
use db::schema::instance::dsl;
use db::schema::vmm::dsl as vmm_dsl;

vmm_dsl::vmm
.filter(vmm_dsl::state.eq(VmmState::Destroyed))
.filter(vmm_dsl::state.eq(vmm_state))
// If the VMM record has already been deleted, we don't need to do
// anything about it --- someone already has.
.filter(vmm_dsl::time_deleted.is_null())
Expand Down Expand Up @@ -734,7 +741,7 @@ impl DataStore {
format!(
"cannot set migration ID {migration_id} for instance \
{instance_id} (perhaps another migration ID is \
already present): {error:#}"
already present): {error:#?}"
),
),
})
Expand Down Expand Up @@ -818,7 +825,15 @@ impl DataStore {
vmm_dsl::vmm
.on(vmm_dsl::sled_id
.eq(sled_dsl::id)
.and(vmm_dsl::time_deleted.is_null()))
.and(vmm_dsl::time_deleted.is_null())
// Ignore instances which are in states that are not
// known to exist on a sled. Since this query drives
// instance-watcher health checking, it is not necessary
// to perform health checks for VMMs that don't actually
// exist in real life.
.and(
vmm_dsl::state.ne_all(VmmState::NONEXISTENT_STATES),
))
.inner_join(
instance_dsl::instance
.on(instance_dsl::id
Expand Down
49 changes: 45 additions & 4 deletions nexus/db-queries/src/db/datastore/vmm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,11 +82,9 @@ impl DataStore {
opctx: &OpContext,
vmm_id: &PropolisUuid,
) -> UpdateResult<bool> {
let valid_states = vec![DbVmmState::Destroyed, DbVmmState::Failed];

let updated = diesel::update(dsl::vmm)
.filter(dsl::id.eq(vmm_id.into_untyped_uuid()))
.filter(dsl::state.eq_any(valid_states))
.filter(dsl::state.eq_any(DbVmmState::DESTROYABLE_STATES))
.filter(dsl::time_deleted.is_null())
.set(dsl::time_deleted.eq(Utc::now()))
.check_if_exists::<Vmm>(vmm_id.into_untyped_uuid())
Expand Down Expand Up @@ -307,6 +305,49 @@ impl DataStore {
})
}

/// Transitions a VMM to the `SagaUnwound` state.
///
/// # Warning
///
/// This may *only* be called by the saga that created a VMM record, as it
/// unconditionally increments the generation number and advances the VMM to
/// the `SagaUnwound` state.
///
/// This is necessary as it is executed in compensating actions for
/// unwinding saga nodes which cannot easily determine whether other
/// actions, which advance the VMM's generation, have executed before the
/// saga unwound.
pub async fn vmm_mark_saga_unwound(
&self,
opctx: &OpContext,
vmm_id: &PropolisUuid,
) -> Result<bool, Error> {
diesel::update(dsl::vmm)
.filter(dsl::time_deleted.is_null())
.filter(dsl::id.eq(vmm_id.into_untyped_uuid()))
.set((
dsl::state.eq(DbVmmState::SagaUnwound),
dsl::time_state_updated.eq(chrono::Utc::now()),
dsl::state_generation.eq(dsl::state_generation + 1),
))
.check_if_exists::<Vmm>(vmm_id.into_untyped_uuid())
.execute_and_check(&*self.pool_connection_authorized(opctx).await?)
.await
.map(|r| match r.status {
UpdateStatus::Updated => true,
UpdateStatus::NotUpdatedButExists => false,
})
.map_err(|e| {
public_error_from_diesel(
e,
ErrorHandler::NotFoundByLookup(
ResourceType::Vmm,
LookupType::ById(vmm_id.into_untyped_uuid()),
),
)
})
}

/// Forcibly overwrites the Propolis IP/Port in the supplied VMM's record with
/// the supplied Propolis IP.
///
Expand Down Expand Up @@ -357,7 +398,7 @@ impl DataStore {

paginated(dsl::vmm, dsl::id, pagparams)
// In order to be considered "abandoned", a VMM must be:
// - in the `Destroyed` or `SagaUnwound` state
// - in the `Destroyed`, `SagaUnwound`, or `Failed` states
.filter(dsl::state.eq_any(DbVmmState::DESTROYABLE_STATES))
// - not deleted yet
.filter(dsl::time_deleted.is_null())
Expand Down
Loading
Loading