Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Reincarnate instances with SagaUnwound VMMs #6669

Merged
merged 13 commits into from
Sep 27, 2024
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Sep 25, 2024

When an instance-start saga unwinds, any VMM it created transitions to
the SagaUnwound state. This causes the instance's effective state to
appear as Failed in the external API. PR #6503 added functionality to
Nexus to automatically restart instances that are in the Failed state
("instance reincarnation"). However, the current instance-reincarnation
task will not automatically restart instances whose instance-start
sagas have unwound, because such instances are not actually in the
Failed state from Nexus' perspective.

This PR implements reincarnation for instances whose instance-start
sagas have failed. This is done by changing the instance_reincarnation
background task to query the database for instances which have
SagaUnwound active VMMs, and then run instance-start sagas for them
identically to how it runs start sagas for Failed instances.

I decided to perform two separate queries to list Failed instances and
to list instances with SagaUnwound VMMs, because the SagaUnwound
query requires a join with the vmm table, and I thought it was a bit
nicer to be able to find Failed instances without having to do the
join, and only do it when looking for SagaUnwound ones. Also, having
two queries makes it easier to distinguish between Failed and
SagaUnwound instances in logging and the OMDB status output. This
ended up being implemented by adding a parameter to the
DataStore::find_reincarnatable_instances method that indicates which
category of instances to select; I had previously considered making the
method on the InstanceReincarnation struct that finds instances and
reincarnates them take the query as a Fn taking the datastore and
DataPageParams and returning an impl Future outputting
Result<Vec<Instance>, ...>,but figuring out generic lifetimes for the
pagination stuff was annoying enough that this felt like the simpler
choice.

Fixes #6638

@hawkw hawkw added the nexus Related to nexus label Sep 25, 2024
@hawkw hawkw requested a review from gjcolombo September 25, 2024 20:25
@hawkw hawkw self-assigned this Sep 25, 2024
@hawkw
Copy link
Member Author

hawkw commented Sep 25, 2024

When merging this, we should also be sure to merge #6658, since otherwise,
SagaUnwound instances will appear as Stopped in the external API even though
we will automatically restart them, which feels weird

since we print these in OMDB, it breaks the success cases expectorate
tests to use unordered hashmaps...
i dont know whats wrong with me
@hawkw hawkw enabled auto-merge (squash) September 27, 2024 17:00
@hawkw
Copy link
Member Author

hawkw commented Sep 27, 2024

Well that's extremely spooky, it looks like this worked fine on commit 0b7f72e but then somehow broke on commit 8f89106: https://buildomat.eng.oxide.computer/wg/0/details/01J8T6F6B4TYVZVGS9NVY6RXJ8/m4ivC9CI7YNrcLE1S1dUTosEDmIl3bfax3fd4qNVIe7XiKua/01J8T6G2PKB9TADGAMV5DAR8R8

@hawkw
Copy link
Member Author

hawkw commented Sep 27, 2024

(also, it occurred to me that we probably want to make unwinding start sagas check if they should immediately kick the reincarnation task...)

@hawkw
Copy link
Member Author

hawkw commented Sep 27, 2024

Aaaand it passes on my machine:

    Finished `test` profile [unoptimized + debuginfo] target(s) in 3m 35s
------------
 Nextest run ID a90e00f1-faac-4db4-82a3-f32ca87dd2bd with nextest profile: default
    Starting 3 tests across 162 binaries (1536 tests and 5 binaries skipped, including 5 binaries via profile.default.default-filter)
       SETUP [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
             [ 00:00:00] [                   ]    0/1539:      
   Compiling nexus-config v0.1.0 (/home/eliza/Code/oxide/omicron/nexus-config)
   Compiling omicron-test-utils v0.1.0 (/home/eliza/Code/oxide/omicron/test-utils)
   Compiling crdb-seed v0.1.0 (/home/eliza/Code/oxide/omicron/dev-tools/crdb-seed)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 3.76s
     Running `target/debug/crdb-seed`
Sep 27 18:54:25.474 INFO Using existing CRDB seed tarball: `/tmp/crdb-base-eliza/7888c2fb782f3500cf5404b9680c8152f4554f551acee683d7a98ec218b57e57.tar`
  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
        PASS [  14.789s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_reincarnates_failed_instances
        PASS [  28.533s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_cooldown_on_subsequent_reincarnations
        PASS [  29.511s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_only_reincarnates_eligible_instances
------------
     Summary [  33.341s] 3 tests run: 3 passed, 1536 skipped

I bet this is a race between periodic and explicit activations of the reincarnation task. Cool.

@hawkw hawkw force-pushed the eliza/saga-rewound branch from 0645a37 to 19f9f16 Compare September 27, 2024 20:37
@hawkw hawkw merged commit d7235b8 into main Sep 27, 2024
15 of 16 checks passed
@hawkw hawkw deleted the eliza/saga-rewound branch September 27, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reincarnate instances with SagaUnwound VMMs
2 participants