test failed in CI: test_omdb_success_cases #6505

rcgoodfellow · 2024-09-02T15:08:57Z

This test failed on a CI run on pull request 6475:

https://github.com/oxidecomputer/omicron/pull/6475/checks?check_run_id=29546110600

https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H#S5276

Excerpt from the log showing the failure:

        FAIL [  25.576s] omicron-omdb::test_all_output test_omdb_success_cases

--- STDOUT:              omicron-omdb::test_all_output test_omdb_success_cases ---

running 1 test
running commands with args: ["db", "disks", "list"]
running commands with args: ["db", "dns", "show"]
running commands with args: ["db", "dns", "diff", "external", "2"]
running commands with args: ["db", "dns", "names", "external", "2"]
running commands with args: ["db", "instances"]
running commands with args: ["db", "reconfigurator-save", "/var/tmp/omicron_tmp/.tmpVjFflB/reconfigurator-save.out"]
running commands with args: ["db", "sleds"]
running commands with args: ["db", "sleds", "-F", "discretionary"]
running commands with args: ["mgs", "inventory"]
running commands with args: ["nexus", "background-tasks", "doc"]
running commands with args: ["nexus", "background-tasks", "show"]
running commands with args: ["nexus", "background-tasks", "show", "saga_recovery"]
running commands with args: ["nexus", "background-tasks", "show", "blueprint_loader", "blueprint_executor"]
running commands with args: ["nexus", "background-tasks", "show", "dns_internal"]
running commands with args: ["nexus", "background-tasks", "show", "dns_external"]
running commands with args: ["nexus", "background-tasks", "show", "all"]
running commands with args: ["nexus", "sagas", "list"]
running commands with args: ["--destructive", "nexus", "sagas", "demo-create"]
running commands with args: ["nexus", "sagas", "list"]
running commands with args: ["--destructive", "nexus", "background-tasks", "activate", "inventory_collection"]
running commands with args: ["nexus", "blueprints", "list"]
running commands with args: ["nexus", "blueprints", "show", "5103da0a-8625-4be7-b03e-16ff5fde04a9"]
running commands with args: ["nexus", "blueprints", "show", "current-target"]
running commands with args: ["nexus", "blueprints", "diff", "5103da0a-8625-4be7-b03e-16ff5fde04a9", "current-target"]
@@ -55,10 +55,13 @@
 ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
 ---------------------------------------------
 stderr:
 note: using database URL postgresql://root@[::1]:REDACTED_PORT/omicron?sslmode=disable
 note: database schema version matches expected (<redacted database version>)
+thread 'tokio-runtime-worker' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-bb8-diesel-0.2.1/src/async_traits.rs:97:14:
+called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(36))
+note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 =============================================
 EXECUTING COMMAND: omdb ["db", "reconfigurator-save", "<TMP_PATH_REDACTED>"]
 termination: Exited(0)
 ---------------------------------------------
 stdout:

test test_omdb_success_cases ... FAILED

failures:

failures:
    test_omdb_success_cases

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 2 filtered out; finished in 25.36s


--- STDERR:              omicron-omdb::test_all_output test_omdb_success_cases ---
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.0.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.0.log"
DB URL: postgresql://root@[::1]:43788/omicron?sslmode=disable
DB address: [::1]:43788
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.2.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.2.log"
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.3.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.3.log"
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_case.19791.4.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_case.19791.4.log"
thread 'test_omdb_success_cases' panicked at dev-tools/omdb/tests/test_all_output.rs:242:5:
assertion failed: string doesn't match the contents of file: "tests/successes.out" see diffset above
                set EXPECTORATE=overwrite if these changes are intentional
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: assert_contents<&str>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/expectorate-1.1.0/src/lib.rs:64:9
   3: {async_fn#0}
             at ./tests/test_all_output.rs:242:5
   4: {async_block#0}
             at ./tests/test_all_output.rs:116:1
   5: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   8: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
   9: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  12: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  16: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  17: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  18: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  21: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  23: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  24: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  25: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  26: test_omdb_success_cases
             at ./tests/test_all_output.rs:116:1
  27: test_all_output::test_omdb_success_cases::{{closure}}
             at ./tests/test_all_output.rs:117:70
  28: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/var/tmp/omicron_tmp/.tmpDdfI0d"
If you would like to access the database for debugging, run the following:

# Run the database
cargo xtask db-dev run --no-populate --store-dir "/var/tmp/omicron_tmp/.tmpDdfI0d/data"
# Access the database. Note the port may change if you run multiple databases.
cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
failed to clean up ClickHouse data dir:
- /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.1-clickhouse-Tjceqh: File exists (os error 17)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpgUhVaY
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpU3R6qP
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpoJXOMm
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpoo1igu

The text was updated successfully, but these errors were encountered:

davepacheco · 2024-09-04T00:08:16Z

Bummer -- and thanks for filing this.

From the output, it looks to me like the test ran the command omdb db instances and that panicked with:

thread 'tokio-runtime-worker' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-bb8-diesel-0.2.1/src/async_traits.rs:97:14:
called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(36))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

That's not very much to go on. We don't have more because this was a subprocess -- the test ultimately failed only because the output didn't match what it expected. I haven't totally given up yet but I've put up #6516 so that if we hit this again we'll get more information about a panic from the subprocess.

davepacheco · 2024-09-04T00:27:43Z

The panic message is coming from here:
https://github.com/oxidecomputer/async-bb8-diesel/blob/1850c9d9a9311ff6a60cadee9023e7693eda3304/src/async_traits.rs#L97

But I think that's just propagating a panic that happened in the middle of just about anything that async-bb8-diesel was doing. There are a few unwraps in in the omdb db instances command itself:

omicron/dev-tools/omdb/src/bin/omdb/db.rs

Lines 2837 to 2920 in a77c31b

    
           /// Run `omdb db instances`: list data about customer VMs. 
        
           async fn cmd_db_instances( 
        
               opctx: &OpContext, 
        
               datastore: &DataStore, 
        
               fetch_opts: &DbFetchOptions, 
        
               running: bool, 
        
           ) -> Result<(), anyhow::Error> { 
        
               use db::schema::instance::dsl; 
        
               use db::schema::vmm::dsl as vmm_dsl; 
        
               let limit = fetch_opts.fetch_limit; 
        
               let mut query = dsl::instance.into_boxed(); 
        
               if !fetch_opts.include_deleted { 
        
                   query = query.filter(dsl::time_deleted.is_null()); 
        
               } 
        
               let instances: Vec<InstanceAndActiveVmm> = query 
        
                   .left_join( 
        
                       vmm_dsl::vmm.on(vmm_dsl::id 
        
                           .nullable() 
        
                           .eq(dsl::active_propolis_id) 
        
                           .and(vmm_dsl::time_deleted.is_null())), 
        
                   ) 
        
                   .limit(i64::from(u32::from(limit))) 
        
                   .select((Instance::as_select(), Option::<Vmm>::as_select())) 
        
                   .load_async(&*datastore.pool_connection_for_tests().await?) 
        
                   .await 
        
                   .context("loading instances")? 
        
                   .into_iter() 
        
                   .map(|i: (Instance, Option<Vmm>)| i.into()) 
        
                   .collect(); 
        
               let ctx = || "listing instances".to_string(); 
        
               check_limit(&instances, limit, ctx); 
        
               let mut rows = Vec::new(); 
        
               let mut h_to_s: HashMap<SledUuid, String> = HashMap::new(); 
        
               for i in instances { 
        
                   let host_serial = if i.vmm().is_some() { 
        
                       if let std::collections::hash_map::Entry::Vacant(e) = 
        
                           h_to_s.entry(i.sled_id().unwrap()) 
        
                       { 
        
                           let (_, my_sled) = LookupPath::new(opctx, datastore) 
        
                               .sled_id(i.sled_id().unwrap().into_untyped_uuid()) 
        
                               .fetch() 
        
                               .await 
        
                               .context("failed to look up sled")?; 
        
                           let host_serial = my_sled.serial_number().to_string(); 
        
                           e.insert(host_serial.to_string()); 
        
                           host_serial.to_string() 
        
                       } else { 
        
                           h_to_s.get(&i.sled_id().unwrap()).unwrap().to_string() 
        
                       } 
        
                   } else { 
        
                       "-".to_string() 
        
                   }; 
        
                   if running && i.effective_state() != InstanceState::Running { 
        
                       continue; 
        
                   } 
        
                   let cir = CustomerInstanceRow { 
        
                       id: i.instance().id().to_string(), 
        
                       name: i.instance().name().to_string(), 
        
                       state: i.effective_state().to_string(), 
        
                       propolis_id: (&i).into(), 
        
                       sled_id: (&i).into(), 
        
                       host_serial, 
        
                   }; 
        
                   rows.push(cir); 
        
               } 
        
               let table = tabled::Table::new(rows) 
        
                   .with(tabled::settings::Style::empty()) 
        
                   .with(tabled::settings::Padding::new(0, 1, 0, 0)) 
        
                   .to_string(); 
        
               println!("{}", table); 
        
               Ok(()) 
        
           }

But if we panicked in those, I don't think it would show up in async-bb8-diesel. I'm trying to figure out what would show up there. We're not using transaction_async in this code so I don't see how we could have entered async-bb8-diesel and then called back out to this code. An example might be if the synchronous load panicked, but that's not our code so that would be surprising.

I'm also going to file an async-bb8-diesel bug because it seems like it could propagate more about the panic error in this situation.

davepacheco · 2024-09-04T00:33:49Z

Actually, I'm not sure this is an async-bb8-diesel bug. Looking more closely at the JoinError, it's saying that the underlying task was cancelled, not that it panicked. How did that happen? Looking at the docs:

When you shut down the executor, it will wait indefinitely for all blocking operations to finish. You can use shutdown_timeout to stop waiting for them after a certain timeout. Be aware that this will still not cancel the tasks — they are simply allowed to keep running after the method returns. It is possible for a blocking task to be cancelled if it has not yet started running, but this is not guaranteed.

One way I could imagine this happening is if the program started an async database operation (like load_async) but then panicked before the corresponding tokio task was started. That might trigger teardown of the executor and we might see this second panic. But then shouldn't we see some information about that other panic?

hawkw · 2024-09-24T19:25:07Z

I hit this one today on PR #6652: https://buildomat.eng.oxide.computer/wg/0/details/01J8JH1GHTE595MAF5YBTA8BS6/3ybl2B1ZPCoj5D0UfV5BgvmfQpYqlpYGkOVlmy3dpaUvVN5L/01J8JH2AVHD7EAR6AT7ZG8WS72#S5595. Wanted to comment here because it looks like this particular flake may result in very different panic messages depending on which OMDB command actually hit the issue, so folks hitting this flake might report new issues for it that are actually duplicates of this.

gjcolombo · 2024-09-27T04:06:31Z

Appears to have bitten #6698, also in an omdb test: https://buildomat.eng.oxide.computer/wg/0/details/01J8RD8168F85NSREWHQ1SDSRG/CYHvVeadfxQl3Yi4jpbKPAmUTiZePQCq9PVtL9VzofHG4CL6/01J8RD8TN8DS3QGWAZHEPVJYEW5

sunshowers · 2024-10-12T00:05:33Z

Another example: https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6015

sunshowers · 2024-10-12T00:10:09Z

Could this be a shutdown task ordering issue? At https://docs.rs/crate/async-bb8-diesel/0.2.1/source/src/async_traits.rs#95, if the spawned task is cancelled due to the runtime shutting down, then there's going to be a panic here.

sunshowers · 2024-10-12T00:24:55Z

Yeah, looking at it I'm pretty sure that it is a shutdown ordering issue. The API is generic over arbitrary E so there's sadly no good place to put in "child task got cancelled, runtime shutting down" at the moment. So we'll probably have to make a breaking change to async-bb8-diesel.

sunshowers · 2024-10-12T00:47:29Z

Hmm, but as Dave pointed out this should only happen if the underlying task was cancelled. And the spawn_blocking documentation promises that the task will never be cancelled. Wonder if there's a deeper tokio bug here.

sunshowers · 2024-10-12T00:58:25Z

Ah, according to tokio-rs/tokio#3805 (comment) what's happening is that the runtime is shutting down before spawn_blocking is called. In that case, a handle is returned but it immediately fails with a cancelled error.

sunshowers · 2024-10-12T01:49:15Z

I've put up a tentative PR at oxidecomputer/async-bb8-diesel#77. I don't feel great about it, but I also don't see another way sadly.

This is going to end up infecting Omicron as well -- we'll no longer be dealing with Diesel errors, but instead with this new wrapper error type. (And here again we'd need to be careful to not panic on all errors -- instead, if the error is a shutdown error, silently ignoring it somehow.)

Ugh.

smklein · 2024-10-14T19:57:44Z

I understand why we think async-bb8-diesel is causing this particular backtrace - the diagnosis of "We are trying to spawn new work while the tokio runtime is shutting down" seems accurate - but this seems like it might be a secondary failure, rather than the primary reason for the test failing.

Framed another way: why are we trying to spawn new work amid a runtime shutdown?

I think that propagating better error information from async-bb8-diesel would be worthwhile, just want to confirm my understanding here that "it's weird omdb db instances is sending new requests to the DB while also shutting down, right?"

hawkw · 2024-10-14T21:43:10Z

I'm guessing the reason we are wondering about async-bb8-diesel is that, as I understand it, nothing else in the omdb db instances command has spawned any tasks in the background, so when the runtime shuts down because the #[tokio::main] function has exited, there isn't anything else left that might be trying to spawn new tasks?

sunshowers · 2024-10-14T22:55:38Z

Is there some kind of background task that might be hitting the DB periodically?

edit: it's a bit hard to be completely sure, but the stack trace does seem to suggest this is happening within a task.

sunshowers · 2024-10-15T04:58:19Z

The error is coming from here:

omicron/nexus/db-queries/src/db/pool_connection.rs

Line 133 in cf4b8df

conn.ping_async().await.map_err(|e| {

-- this looks like a validity check that qorb is doing, which makes sense. Sounds like we may want to change the qorb API as well.

smklein · 2024-10-15T17:31:26Z

Yeah, this tracks with the timing when qorb was integrated into Omicron (in dd85331, which landed right before this bug was first reported). On the bright side, this doesn't seem like a bug that would impact prod, but a test shutdown ordering.

I'll look at how we're terminating the pool. If we can cleanly terminate the qorb pool when the test exits, that should also help resolve this issue.

davepacheco · 2024-10-15T18:22:08Z

Sorry I'm a little confused about our hypothesized sequence of events leading to this. Is it something like: in the child process:

it sets up everything, runs the bulk of the command, then drops its qorb connection handle
qorb has previously started some other tokio task ...
the main task finishes and the executor gets dropped
that qorb task gets to running ping_async, which enters async-bb8-diesel, which does spawn_blocking, which returns a handle that's already cancelled, and async-bb8-diesel panics?

(I feel like that's not exactly right but I'm just trying to put together the pieces above)

smklein · 2024-10-15T18:37:29Z

Yeah, it's worthwhile clarifying, there's a lot of moving pieces. This is my hypothesis:

This test is using nexus_test, so it spawns a qorb pool within Nexus. This spawns background tasks in the test process.
The test launches a bunch of processes, which each invoke omdb. I think these processes are actually unrelated to the failure we're seeing? Namely, the "omdb part" of this doesn't matter, it just matters that we're running a nexus_test.
Once the test ends, it drops the qorb pool, along with Nexus and the entire tokio runtime.
sometimes, one of the qorb tasks is still running while the tokio runtime is getting dropped. It calls ping_async, which makes an async_bb8_diesel call, and panics. But really, it could be doing any background work, like trying to make a new connection to the DB.

The qorb "termination" code is pretty half-baked right now -- it just calls abort on background tasks in drop, but this signals them to start cancellation, and doesn't necessarily guarantee they get cancelled.

My plan is the following:

Add explicit termination code within qorb, rather than relying on drop. This should give us a way to ensure we've stopped all background tasks.
Use that explicit termination in nexus_test. This should stop all background tasks while we're exiting the test, but before the tokio runtime actually starts shutting down.

smklein · 2024-10-16T19:10:12Z

#6881 is my proposed fix, with an attempt to summarize "what I believe is going wrong" in the PR message.

rcgoodfellow added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Sep 2, 2024

smklein mentioned this issue Sep 3, 2024

test failure: omicron-omdb::test_all_output test_omdb_success_cases #6509

Closed

davepacheco mentioned this issue Sep 4, 2024

include backtrace from omdb panic from within omdb tests #6516

Merged

davepacheco mentioned this issue Sep 8, 2024

test failed in CI: omicron-omdb::test_all_output test_omdb_success_cases #6543

Closed

karencfv mentioned this issue Sep 19, 2024

Add native TCP ports to test ClickHouseDeployments #6603

Merged

hawkw mentioned this issue Sep 24, 2024

[nexus] Allow stopping Failed instances #6652

Merged

iximeow mentioned this issue Sep 26, 2024

quasi-lock dependencies in omicron-common job #6692

Merged

sunshowers mentioned this issue Oct 12, 2024

[RFC] handle runtime shutdowns while tasks are being spawned oxidecomputer/async-bb8-diesel#77

Closed

smklein self-assigned this Oct 15, 2024

sunshowers mentioned this issue Oct 15, 2024

[do not land] handle runtime shutdown more gracefully #6875

Closed

This was referenced Oct 15, 2024

Add explicit API to terminate pool cleanly oxidecomputer/qorb#45

Merged

[nexus] Explicitly terminate pools for qorb #6881

Merged

smklein closed this as completed in #6881 Oct 18, 2024

smklein closed this as completed in 14edbf3 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test failed in CI: test_omdb_success_cases #6505

test failed in CI: test_omdb_success_cases #6505

rcgoodfellow commented Sep 2, 2024

davepacheco commented Sep 4, 2024

davepacheco commented Sep 4, 2024

davepacheco commented Sep 4, 2024

hawkw commented Sep 24, 2024

gjcolombo commented Sep 27, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024 •

edited

Loading

smklein commented Oct 14, 2024

hawkw commented Oct 14, 2024

sunshowers commented Oct 14, 2024

sunshowers commented Oct 15, 2024 •

edited

Loading

smklein commented Oct 15, 2024

davepacheco commented Oct 15, 2024

smklein commented Oct 15, 2024

smklein commented Oct 16, 2024

test failed in CI: test_omdb_success_cases #6505

test failed in CI: test_omdb_success_cases #6505

Comments

rcgoodfellow commented Sep 2, 2024

davepacheco commented Sep 4, 2024

davepacheco commented Sep 4, 2024

davepacheco commented Sep 4, 2024

hawkw commented Sep 24, 2024

gjcolombo commented Sep 27, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024

sunshowers commented Oct 12, 2024 • edited Loading

smklein commented Oct 14, 2024

hawkw commented Oct 14, 2024

sunshowers commented Oct 14, 2024

sunshowers commented Oct 15, 2024 • edited Loading

smklein commented Oct 15, 2024

davepacheco commented Oct 15, 2024

smklein commented Oct 15, 2024

smklein commented Oct 16, 2024

sunshowers commented Oct 12, 2024 •

edited

Loading

sunshowers commented Oct 15, 2024 •

edited

Loading