Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: omicron-cockroach-admin cockroach_cli::tests::test_node_decommission_compatibility #6506

Open
hawkw opened this issue Sep 2, 2024 · 3 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@hawkw
Copy link
Member

hawkw commented Sep 2, 2024

This test failed on a CI run on #6499:

Unfortunately, I rebased the PR and have since lost the GitHub check URL for the failure...

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J6Q8748MZKW18KSEF4540CQZ/iy5XpuAznwyLAetcf2Mfjh1o9XO11hzknMn8Mr0yhl9xfXK3/01J6Q87HTCRRS09RECVKRHBBSR#S5100

Excerpt from the log showing the failure:

failures:
    cockroach_cli::tests::test_node_decommission_compatibility

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 2 filtered out; finished in 3.47s


--- STDERR:              omicron-cockroach-admin cockroach_cli::tests::test_node_decommission_compatibility ---
log file: /var/tmp/omicron_tmp/omicron_cockroach_admin-e0b1ce4a08f3ea08-test_node_decommission_compatibility.38877.0.log
note: configured to log to "/var/tmp/omicron_tmp/omicron_cockroach_admin-e0b1ce4a08f3ea08-test_node_decommission_compatibility.38877.0.log"
thread 'cockroach_cli::tests::test_node_decommission_compatibility' panicked at cockroach-admin/src/cockroach_cli.rs:236:36:
header line
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_display
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:262:5
   3: core::option::expect_failed
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/option.rs:1995:5
   4: core::option::Option<T>::expect
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/option.rs:898:21
   5: {async_block#0}
             at ./src/cockroach_cli.rs:236:23
   6: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   8: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   9: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
  10: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  11: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  12: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  13: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  14: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  15: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  16: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  17: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  18: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  19: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  20: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  21: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  22: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  23: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  24: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  25: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  26: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  27: test_node_decommission_compatibility
             at ./src/cockroach_cli.rs:269:9
  28: omicron_cockroach_admin::cockroach_cli::tests::test_node_decommission_compatibility::{{closure}}
             at ./src/cockroach_cli.rs:208:52
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  30: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/var/tmp/omicron_tmp/.tmpQpS5s8"
If you would like to access the database for debugging, run the following:

# Run the database
cargo xtask db-dev run --no-populate --store-dir "/var/tmp/omicron_tmp/.tmpQpS5s8/data"
# Access the database. Note the port may change if you run multiple databases.
cockroach sql --host=localhost:32221 --insecure

It looks like we are waiting for output from a CRDB CLI process that never seems to appear? The test has passed successfully on subsequent runs.

@hawkw hawkw added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Sep 2, 2024
@jgallagher
Copy link
Contributor

Weird! I think a panic there means stdout was empty, so (presumably?) the command failed. I'll add some logging that should help the next time it flakes, assuming that guess is correct.

jgallagher added a commit that referenced this issue Sep 3, 2024
Hoping to shed light on the test flake in #6506
jgallagher added a commit that referenced this issue Sep 3, 2024
Hoping to shed light on the test flake in #6506
@iliana
Copy link
Contributor

iliana commented Sep 26, 2024

Some logging now! https://buildomat.eng.oxide.computer/wg/0/details/01J8QWRHJZ9MWKRFH6N67ZE6TA/L16FQY2MG9K1N2SdCCur5TZDpyIdjJBGkNDcoewBXHw7U8gH/01J8QWSDBXPAZZ83A3GDJ8JYAM#S5865

Executed [node decommission 1 --wait none --url postgresql://root@[::1]:33723/omicron?sslmode=disable --format csv]
  Status: exit status: 1
  Stdout:
----------------
----------------
  Stderr:
----------------
warning: --url specifies user/password, but command "decommission" does not accept user/password details - details ignored
warning: --url specifies database "omicron", but command "decommission" does not accept a database name - database name ignored
ERROR: connection lost.
rpc error: code = Unavailable desc = node waiting for init; /cockroach.server.serverpb.Status/Node not available
Failed running "node decommission"

@jgallagher
Copy link
Contributor

Hmm, interesting! So it looks like the test thinks cockroach is up and tries to decommission a node, but the node it's talking to isn't actually up yet. I think the logs confirm this. From https://buildomat.eng.oxide.computer/wg/0/artefact/01J8QWRHJZ9MWKRFH6N67ZE6TA/L16FQY2MG9K1N2SdCCur5TZDpyIdjJBGkNDcoewBXHw7U8gH/01J8QWSDBXPAZZ83A3GDJ8JYAM/01J8QZYHP5MQ110BP4NF9JSR6Q/omicron_cockroach_admin-953fa556bc995186-test_node_decommission_compatibility.43654.0.log?format=x-bunyan, we see these log lines (note the timestamps):

2024-09-26T19:42:56.977Z	INFO	test_node_decommission_compatibility: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/var/tmp/omicron_tmp/.tmpCYRwlf/data,ballast-size=0 --listen-addr [::1]:0 --listening-url-file /var/tmp/omicron_tmp/.tmpCYRwlf/listen-url
2024-09-26T19:42:58.581Z	INFO	test_node_decommission_compatibility: cockroach pid: 43940
2024-09-26T19:42:58.581Z	INFO	test_node_decommission_compatibility: cockroach listen URL: postgresql://root@[::1]:33723/omicron?sslmode=disable

The last two are from setup_database:

info!(&log, "cockroach pid: {}", database.pid());
let db_url = database.pg_config();
info!(&log, "cockroach listen URL: {}", db_url);

Immediately following that is this match for populating the db:

// If we populate the storage directory by importing the '.sql'
// file, we must do so after the DB has started.
match &storage_source {
StorageSource::DoNotPopulate | StorageSource::CopyFromSeed { .. } => {}
StorageSource::PopulateLatest { .. } => {
info!(&log, "cockroach: populating");
database.populate().await.expect("failed to populate database");
info!(&log, "cockroach: populated");
}
}

In this test we're using CopyFromSeed, so that does nothing, and then setup_database returns; the next thing the test does is try to decommission a node.

But if we switch over to the cockroach startup logfile (https://buildomat.eng.oxide.computer/wg/0/artefact/01J8QWRHJZ9MWKRFH6N67ZE6TA/L16FQY2MG9K1N2SdCCur5TZDpyIdjJBGkNDcoewBXHw7U8gH/01J8QWSDBXPAZZ83A3GDJ8JYAM/01J8QZYASHFKGGAKVXYBXM1ZNJ/cockroach.log?format=x-bunyan, which I believe is the right one based on the command line args matching what setup_database logged above), a whole bunch more startup stuff happens after 2024-09-26T19:42:58.581Z (the timestamp on when setup_database thought setup was done). I'm not sure when we ought to consider CRDB "up", but probably we should wait for the "setup complete" bits?

19:42:59.493026 49 1@server/server_sql.go:1360 ⋮ [n1] 58  serving sql connections
19:42:59.493168 49 1@cli/start.go:1028 ⋮ [n1] 59  node startup completed:
19:42:59.493168 49 1@cli/start.go:1028 ⋮ [n1] 59 +CockroachDB node starting at 2024-09-26 19:42:59.493063302 +0000 UTC (took 1.7s)

This is nearly a full second later than when the test thought the database was up. My gut feeling is that setup_database should wait for this, somehow; does that seem reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

3 participants